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Preface 


This beginning textbook in statistical methods has been written 
to meet the needs of undergraduate college students who are 
concentrating in sociology and related subjects. In the choice 
of methods, in the character of the illustrative data and problems, 
and in emphasis throughout, it differs from the texts in eco- 
nomic or educational statistics that have generally been used 
by such students. 

The chief purpose has been to provide students who expect to 
become professional sociologists with the necessary groundwork 
for more advanced training in quantitative research methods. 
Familiarity with the topics included, however, should enable 
those who take no further courses in statistics to understand 
most of the statistical studies and references that now appear 
in the sociological journals and literature. Nonprofessional 
students who go through the course should learn to appreciate 
some of the difficulties involved in the study of social problems, 
and to be more wary of careless and prejudiced thinking in this 
field; for mathematical statistics represents a rigorous form of 
applied logic. 

Unfortunately, most students who elect to specialize in soci- 
ology have no mathematical training beyond high school algebra. 
This fact has compelled the omission of mathematical deriva- 
tions, with the exception of a few very simple ones. As a 
substitute, an attempt has been made to point out assumptions 
that should be watched in using the various formulas. Students 
who plan to go on in the subject, however, should begin at once 
to build up an adequate mathematical background. 

Because of its complications and as yet very infrequent use in 
sociological research, small-sampling theory has for the most part 
been omitted from this elementary treatment. 

The amount of material covered is more than enough for a 
semester’s work with an average class, so that some selection of 
topics is possible for the instructor. Under certain circum- 
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stances, it may be advisable to omit the less easy sections of 
chapters IX, XI, XII, XIII, and XIV. 

Constant practice in working statistical problems is indispensa- 
ble for mastery of the subject. The problems given at the end 
of each chapter are intended to be only suggestive; they should 
be greatly multiplied for laboratory purposes. 

Thanks are due Professor E. A. Gaumnitz of the University 
of Wisconsin, who has read the manuscript and made helpful 
suggestions, and Mr. Robert J. Hader, who has eliminated 
numerous minor errors. 

Special acknowledgment is made of permission by Prof. R. A. 
Fisher and his publishers, Oliver & Boyd, Edinburgh, to use the 
Table of Chi-square and the Table of Values of the Correlation 
Coefficient for Different Levels of Significance, which appear as 
Tables 2 and 4 in the Appendix of this book. Many other 
publishers have been kind enough to grant permission to use 
tables and material, specific acknowledgment of which has been 
made in place. 

Tuomas C. McCormick. 

Mapison, Wis., 

August, 1941. 
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PART I 


Statistics in Social Research 
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CHAPTER I 
INTRODUCTORY 


1. The Origins of Statistics.— The word statistics was used in 
Great Britain and Statistik in Germany as early as the eighteenth 
century to refer to collections of information of any kind about a 
state (state-istics). As time passed, “statistics” came to be 
limited to quantitative data or figures on wealth, taxes, marriages, 
baptisms, deaths, and the like. Distinguished pioneers in the 
field were the Germans, Achenwall and Büsching. Modern 
agencies representing this type of statistics are the census bureaus 
of the United States and other nations. 

Mathematical statistics, а branch of mathematical theory, 
originated in investigations of birth and death rates and in 
efforts to solve problems growing out of games of chance. Among 
the great early vital statisticians were Graunt and Petty of 
England and Siissmilch of Germany. The fundamentals of 
the theory of probability were developed from the seventeenth 
to the nineteenth century by such eminent mathematicians as 

2 Pascal, Bernoulli, de Moivre, Laplace, and Gauss. 

Elementary mathematical statistics was popularized in the 
nineteenth century by the Belgian, Quetelet, who applied it to a 
wide variety of topics, including physical anthropology and 
crime. He is sometimes called the father of social statistics as 
the extension of statistics to sociological problems may be 


termed. 
thematical statistics and its use in 


A rapid expansion in ma t 
science occurred in England during the first quarter of the 
arl Pearson, following 


present century through the work of К 

earlier efforts by Sir Francis Galton. These two men were 

biologists, and Pearson was & mathematician as well. As a 

result of this phase, modern statistical methods bear the imprint 

of adaptation to biological data. \ 
Mathematical statistics has gradually become а major method 

of research in the fields of agriculture, biology, educational 
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psychology, psychology, geography, and physical anthropology. 
Among the social sciences, education, economics, and social 
psychology at present lead in the proportion of statistical studies 
published, with sociology fourth and political science fifth. 
Statistical analysis is still rare in cultural anthropology and 
history. In a different direction, statistics is finding application 
in mathematical physics, engineering, and medicine. 

In this book, we shall be interested in elementary statistical 
methods only as tools of investigation in sociology and related 
social sciences. 

2. Quality and Quantity.—A qualitative difference implies a 
difference in nature, such as we recognize between a family and a 
church. A quantitative difference refers to a variation in 
amount between two or more instances of the same quality: for 
example, an intelligence quotient (I.Q.) of 112 is 14 units greater 
than an intelligence quotient of 98. 

Different qualities must be compared in terms of common 
subqualities (common denominators), as & Presbyterian family 
and a Methodist church, a family of five members and a church 
of 500 members. A pure quality, moreover, can vary only in 
amount. It follows that all comparison must consist in noting 
what qualities are and are not common to A and B, and how 
each common quality varies in amount from A to B. The city 
and the country may be compared in terms of common qualities 
like population density, birth rate, death rate, incidence of 
tuberculosis, intelligence quotients, honesty, and so on; but in 
each instance the difference must be in terms of amount. Thus 
the birth rate of the city is 15 per 1,000, that of the country is 
22 per 1,000; and country people are believed to be more honest 
than city people. The last judgment is no less quantitative in 
nature because it is impressionistic and rough. 

Because comparison is basic to knowledge, and quantitative 
judgments are inseparable from comparison, quantitative judg- 
ments are unavoidable in science. It is thus easy to understand 
why scientists have gradually developed more and more system- 
atic and reliable ways of making quantitative judgments, such 
as we have in the many branches of mathematics, including 
mathematical statistics. 

3. Statistics, the Method of Probabilities.—The questions in 
which social scientists are interested do not have exact or certain 
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quantitative answers. For example, if we ask what is the 
relation between the occurrence of divorce and the presence of 
children in the home, we find that divorce occurs both among 
couples with children and among couples without children, 
but relatively more often in the case of the latter. We cannot 
say that divorce takes place only when there are no children; but 
we can say that divorce is reported in so many childless marriages 
per 1,000, and in so many fertile marriages per 1,000. Or, 
expressing it a little differently, we can say that the chances of 
divorce are X in 1,000 in the case of a childless couple, and Y 
in 1,000 in the case of a fertile couple. 

Statistical methods are specially designed for the analysis 
of quantitative! data like those above that result from many 
causes, some or all of which cannot be completely controlled. 
Outside the scientific laboratory, and even in much laboratory 
research, adequate control over all factors is out of the question. 
For this reason, the statistical method has general application. 

р Mathematical statistics is a direct logical extension to practical 
Situations of the exact quantitative methods used in the labora- 
tory experiments of the physical sciences. When precise 
Measurement and complete control over all factors are possible, 
а mathematical equation can be set up from which the value of 
a dependent factor, Y, can be estimated exactly for any given 
value of an independent factor, X. For example, if we know 
the distance, X, of an object from the ground, we can calculate 
from the law of falling bodies the time, Y, it will take for the 
object to fall in a vacuum. When we actually drop an object 
under these controlled conditions, а stop watch will always 
register the length of time predicted by the equation. The 
likelihood that the period observed in any competent repetition 
of the experiment will be that computed from the equation is 
certainty, 

If, however, the object happ 
dropped under ordinary atmospheric со 
E а vacuum, the situation 18 different. 
noe are uncontrolled or unk 
Onger register the time predicte 

evertheless, if a large number 
hes quantitative data are meant data that can be me 

iscussed below in Chap. II. 


ens to be & feather which is 
nditions rather than 
In proportion as the 
nown, the stop watch will no 
d by the law of falling bodies. 
of experiments are made by 
asured or counted, 
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dropping the feather under ordinary atmospheric conditions 
from the same distance, the time of falling will be found to vary 
around some average time, being sometimes more and sometimes 
less. Similarly, the average time of falling from other distances, 
X, can be found, and an equation worked out from which the 
average time of falling, Y, can be estimated for any distance, X. 
Then by studying the varying time required for the object to 
fall a given distance, it may be established that in say two-thirds 
of the trials the time does not vary from the average time, say 
2 sec., by more than say 0.1 sec. This enables us to make a 
prediction from our empirical equation. We can say that if our 
feather is dropped under ordinary atmospheric conditions from 
a given height, the time required to fall will, two out of three 
times, in the long run, vary from an estimated average of 2 sec. 
by not more than 0.1 sec. in either direction. That is, in two 
out of three trials, the time of falling will be between 1.9 and 
2.1 sec. 

This is, broadly speaking, the kind of estimate that mathe- 
matical statistics furnishes in the social sciences. In essence it is 
always a calculation of probabilities. The “pure” mathematical 
formula of the laboratory is merely a special case of the statistical 
equation, being the limit that the latter approaches as the amount 
of control and precision of measurement are increased. If 
sociological data could be exactly controlled and measured, the 
element of probability would disappear, and the statistical 
equation would become a precise one like the law of falling 
bodies. 

4. Representative Data.—Most sociological studies, statistical 
or otherwise, deal with samples rather than with complete data. 
If farm life in a given state is to be investigated, certain farms 
are taken as a sample to represent all the farms in the state 
regarded as the wniverse. The essential requirements of a good 
sample are that every item in the universe from which the sample 
is drawn shall have an equal chance of being included in the 
sample, and the sample must be large enough to include every 
kind of item in the universe in something like the correct propor- 
tions. The proper size of the sample depends somewhat on how 
much the items in the universe vary among themselves. Poor 
samples that include items from outside the universe they are 
intended to represent, that omit important elements of the 
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universe, or that include elements of the universe іп the wrong 
proportions, are a fertile source of false conclusions in social 
research. A large part of mathematical statistics deals with 
the problems of sampling. 

5. Statistics and the Individual.—It is commonly thought that 
Statistics cannot deal with the individual, but must confineitself 
to group averages. There is really nothing to prevent a statis- 
tical investigation of an individual. An individual may be 
readily analyzed into factors or units of various kinds, and the 
relationships of these to other factors in the same personality 
and in the environment can be studied by the same methods 
that are now used in studying groups of individuals. As & 
matter of economy, however, society will seldom want to subject 
individuals to scientific study except as types, which, of course, 
lead back to group averages. 

6. Interpretation of Statistical Results.—Statistics employs 
figures and mathematical symbols that represent definite factors 
in a particular problem. In interpreting statistical results, 
therefore, care must be taken that each symbol is given the 
Same meaning that was assigned to it at the beginning of the 
problem, and to which no important, exceptions were allowed 
during the study. 

It is sometimes puzzling to understand the reasons for a 
Statistical fact, and offhand explanations may be found at the 
end of even careful studies. But if the original study was not 
sufficiently inclusive to clarify some point of interest, its reliable 
explanation can consistently come only from further research. 
For example, if an investigation discovers that a larger proportion 
of women are married in cities where the number of men exceeds 
the number of women than in cities where the two sexes are 
€qual in number, or where women outnumber men, one may 
speculate that this is because men do the proposing. It should be 
made clear, however, that such an explanation is only a plausible 
“hunch,” which should be tested if it is considered of enough 
importance. Е » 

Difficulty may also be experienced in interpreting just what 
Certain statistical concepts mean, е.9., correlation coefficients, 
averages, or tests of. statistical significance. The only help 
here is a clearer understanding of statistical methods, and 
especially of the mathematical assumptions that underlie them. 
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7. Statistics Not a Mechanical Method.—Although the sta- 
tistical method allows data to be treated by systematic and 
standardized techniques, it is a serious mistake to suppose that 
it is a mechanical method that may be substituted for hard and 
original thinking. On the contrary, mathematical statistics is 
merely a set of powerful logical tools that call for a high type of 
judgment and skill for their successful use. The statistical 
investigator must know what techniques are valid and effective 
for a given problem, and when quantitative methods are not 
appropriate at all. He needs insight to select worthwhile 
problems, and intimate knowledge of the data to interpret his 
findings, no less than does any other type of investigator. 

8. Simplicity the Ideal.—The experienced statistician always 
prefers simple to complex methods, when the two are equally 
effective. The beginner will do well not to yield to the tempta- 
tion to depart from this sensible rule. 


Exercises 


1. Briefly summarize the history of statistics and the extent of its 
use as а method of research. 

2. Distinguish between quality and quantity. Illustrate. 

3. Can you find an exception to the proposition that all comparison 
is quantitative? 

4. To what general kind of research situation is statistics appropriate, 
and why? Illustrate. 

5. What is the relationship of the statistical equation to the mathe- 
matical “law” of physics? 

6. a. How exactly can predictions be made by means of statistical 
methods? 

5. How serious a handicap does this impose on social research? 

7. What is the likelihood in the field of social research that the statis- 
tical method will some day be replaced by exact mathematical formulas 
like those of physics? Explain. 

8. Comment briefly on the following published statement: 

“ Jobless Survey to Bare Truth, C. C. Head Says: Pres. George Davis 
of the Chamber of Commerce of the United States said Saturday an 
impartial survey of the employable jobless would show their numbers 
had been exaggerated and disprove alleged needs for spreading work 
by reducing working hours. 

“He said the chamber recently employed a statistical agency to 
make a sample survey of 100 relief recipients in a representative city 
of more than 100,000 population. The names of 50 men and 50 women 
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were picked at random from Federal and local governmental relief rolls 
in the city. 

“The survey showed, he said, that 44 out of the 100 never had been 
employed in private business. Seventeen were over 70 years and 82 
never had a bank or savings account. 

“He says the figures point out that the greater number of those 
labeled as unemployed could not or would not work in private industry 
even if jobs were available.” 

9. Give an example of representative and unrepresentative, adequate 
and inadequate sampling that might occur, or has occurred, in social 
research, 
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CHAPTER II 
THE QUANTIFICATION OF SOCIAL DATA 


1. Definition and Counting.—The methods of statistics are 
applicable only to data that can be expressed in some kind of 
countable units. Any event or quality that can be recognized 
can be counted. If we know a happy marriage when we see one, 
we can count the happy marriages in a sample of marriages. 
Nothing simpler can be done to a concept than to count how 
many times instances of it occur. If the concept is not suffi- 
ciently recognizable for its instances to be counted, one may 
fairly assume that it is not yet ready for any kind of scientific 
Manipulation, except attempts to arrive at a more reliable 
definition. 

2. Classification.—If a concept (e.g., “conflict behavior”) can 
be broken down into two or more subcategories (e.g., “war,” 
"revolution," etc.) that can be defined well enough to be told 
apart, its cases can be classified. Classification makes possible 
the counting of instances in each class, which may then serve as a 
basis for considerable statistical analysis. We have simple 
classification whenever data are sorted into categories that are 
entirely wnordered with respect to amount. For example, we 
may classify our acquaintances as religious and nonreligious; 
we may classify Americans as native white of native parentage, 
native white of foreign parentage, foreign born, and so on. 
Data may also be classified with respect to two or more criteria 
at a time, as married couples by occupation of husband, by 
income, and by number of children. The points to watch in 
classification are careful, objective definition of the several 
categories in terms of criteria that can be recognized in the 
instances to be classified, and independent reclassification of 
the instances by other competent investigators, to determine 
the reliability of the classification. Logically, any classification 
should be based on the same criterion throughout. Thus, 
it would not do to classify some of the foreign born as Catholics 
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or Protestants, and the rest as Italians, Jews, Germans, and so on. 
Also, any classification should be totally inclusive of the class 
defined and exclusive of all other classes. "That is, if we are 
dealing with all the foreign born in the United States, the 
Rumanians should not be omitted, nor should the American 
Indians be included. 

8. Measurement of Amount.—The fact that any quality, such 
as happiness in marriage, varies in degree, sooner or later forces 
the sociologist to go beyond the mere counting of instances, and 
to attempt to measure the intensity of the quality in a given 
instance or set of instances. For example, we may score the 
answers of married couples to a questionnaire and may regard 
the score of any couple as an index of the amount of happiness 
that they derive from their relationship. The central problem 
is, again, to find a unit in terms of which at least the relative 
amount of the quality can be measured. This is seldom easy to 
do, and must usually be approached through the devices of 
ranking, rating, or scoring. 

4. Ranking.—Ranking, or the arrangement of the instances 
of a quality in order of amount, has been called the most ele- 
mentary form of measurement. We consider person А more 
cooperative than person B, B more cooperative than C, and so on. 
To increase the reliability of these judgments, the ranking may 
be done independently by several qualified judges, and the 
&verage ranks taken. Greater accuracy is sometimes obtained 
by ranking each item with respect to every other item, t.e., by 
all possible pairs. Where qualified and careful judges cannot be 
obtained, ranking should not be used. Аз soon as the instances 
of a quality are ranked, they become capable of a fair amount of 
Statistical treatment, including rank correlation.? у е 

5. Rating.—Similar to ranking is rating, or the classification 
of items into ascending, or ordered, classes. There are usually 
three to seven of these classes. An odd number allows for a. 
median class, which is desirable. Thus psychiatrists may rate 


Persons in terms of their intelligence as Mentally Defective, 


Slow-dull, Slow, Average, Fairly Intelligent, Distinctly Capable, 


Е. A. Burtt, Principles and 


1 See “classification” i text in logic, e.g. 
assification" in any Д Й уы. York, 


Problems of Right Thinking, pp. 162-164, Harper & Bro 
928. 
* See Chap. X, Sec. 8. 
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and Very Able. Classification of instances into categories like 
these should be done independently by two or more persons as a 
check. If there is good agreement in the placing of individual 
instances, the percentages of the instances put in each given 
category by several judges may then be averaged to improve the 
accuracy. Self-ratings may be used, as well as ratings by others. 

6. Scoring.—In the case of most score cards, the experimenter 
decides impressionistically what subscore, usually a percentage, 
should be given to each aspect of a variable (e.g., the socio- 
economic status of a home). In other cases, a subscore is 
determined by counting the number of a certain item present 
in each instance (e.g., books in a home), or by measurement in 
the stricter sense (e.g., annual family income in dollars). The 
total score is the sum of the subscores on the different items 
included in the card. Usually the equality of the units, the 
placing of the zero point, the weightings, and the meaning of the 
total score are open to question; but in any case the total score 
represents a series of accumulated judgments reduced to a 


numerical common denominator. Scoring devices may be quite ` 


elaborate, as may be seen by inspecting Chapin’s living room 
“scale” for scoring the socioeconomic status of a home, the 
Stanford-Binet intelligence test, or score cards for, say, dairy 
cattle used in judging contests at livestock shows. To show 
that they are parts of the same or associated things, the score 
on each item included on a score card should as a rule be high 
when the total score on the card is high, low when the latter їз 
low. The theory of the score card is that the total score is an 
index or function of (varies with) the amount of the quality 
it is attempting to measure. Part of a living room score card 
designed by F. Stuart Chapin to measure the socioeconomic 
status of American homes is reproduced below. 


Cuapin’s ScanE ror Ratine Living Room EQUIPMENT! 
DIRECTIONS TO VISITOR 
1. The following list of items is for the guidance of the recorder. Not 
all of the features listed will be found in any one home. Entries on the 
schedules should, however, follow the order and numbering indicated. 


1 F. Stuart Снарх, Scale for Rating Living Room Equipment, American 
Journal of Sociology, Vol. 37, pp. 583, 584, 1932. 
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Weights appear after the names of the respective items. Disregard 
these weights in recording. Only when the list is finally checked 
should the individual items be multiplied by these weights and the 
sum of the weighted score be computed, and then only after leaving the 
home. АП information is confidential. 

2. Check or underline the articles or items present. If more than 
one, write 2, 3, or 4, as the case may be. 

3. Do not enter the score of any article or feature present. Com- 
plete recording before attempting to enter scores. 

4. In cases where the family has no real living room, but uses the 
room at nights as a bedroom, or during the day as a kitchen or as a 
dining room, or as both, in addition to use of room as the chief gathering 
place of the family, please note this fact clearly and describe for what 
purposes the room is used. 

5. When possible, it is desirable to have a living room checked twice. 
This may be done in either of two ways. 

a. After an interval of two or three weeks the same visitor may 

recheck the room. The first schedule should be marked I, the 


second II. 
b. After an interval or simultaneously, the room may be checked by 


two different visitors. One schedule should be marked A, the 
other B. 


Scores of the same h 


of homes are scored twi 
the scores. Please report findings to Е. S 


Minnesota, 


omes on two trials should be similar. If a group 
ce there should be a high correlation between 
tuart Chapin, University of 


Scuepunp or LIVING Room EQUIPMENT 


I. Fixed Features 4, Woodwork Ce LESER 
1. Floor...... n === Painted 1, var- 


Softwood 1, hard- nished 2, stained 3, 


wood 2, composi- oiled 4. / 
tion 3, stone 4. 5, Door protection... ——— 
2. Floor covering. ---- — Screen 1, storm 
Composition 1, сат- door 1. 
pet 2, small rugs 3, 6. Windows 
large rug 4, Orien- 1 each window.... ——— 
tal rug 6. 7. Window protection? 
3. Wall covering.» - FEA Screen, blind, net- 
Paper 1, calcimine ting, storm sash, 
2, plain paint 3, ‚ awning, shutter, 1 
decorative paint 4, eel. 
poset pange e if used in season and so record. 


1 If checked out of season, ascertain 
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II. 


© 


© 


10. 


11 


12. 


13 
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Вснероте or Ілуіхс Room EqviPMENT.— (Continued) 


Window covering?. 
Shades 1, curtains 
2, drapes 3. 


. Fireplace......... 


Imitation 1, gas 2, 
wood 4, coal 4. 
Fire utensils....... 
Andirons, screen, 
poker, tongs, shov- 
el, brush, hod, bas- 
ket, rack, 1 each. 
Aun кг у. 
Stove 1, hot air 2, 
steam 3, hot water 
4, 

. Artificial light 
Kerosene 1, gas 2, 
electric 3. 

. Artificial 

- Clothes closets 1. . 
Total section I.... 


ventila- 


Built-in Features 


15. 


16, 


18. 
19. 


Book containers... 
Shelves 1, cases 2, 
Beds ice fy. EY 
In a sideboard 1, in 
a ceiling 2, in a 
door 3. 

Window seats 1... 
Window boxes 1... 
Total section II... 


21. 


22. 


23. 


27. 
28. 


ll 


24. 


25. 
26. 


III. Standard Furniture 
20. 


Хаер. 2222.2. 


Straight, rocker, 
arm-chair, high 
chair, 1 each. 

Stool or bench... . 
High stool foot- 
stool, piano stool, 
piano bench, 1 
each. 

l, sanitary 
couch 2, chaise 
longue 3, daybed 4, 
davenport 5, bed- 
davenport 6. 
Business 1,. 
sonal-social 2. 
Bookcases 1 
Wardrobe or mov- 
able cabinet 1..... 
Sewing cabinet 1. . 
Sewing machine... 
Hand power 1, foot 
power 2, electric 3. 


Ete., etc. 


T. The Scale.— The ideal measuring device is the scale. By 
a scale is meant a sequence of interchangeable external units 
numbered from zero, such as a straightedge marked off into feet 
and inches. In sociology and psychology most attempts to 
develop scales have started from ranks or ratings. One of the 


simplest devices is the so-called 


following is an example. 


1Tf checked out of season, 


0 25 50 
Completely Submissive Average 
submissive 


graphic rating scale. The 


75 100 
Dominating Completely 
dominating 


Fic. 1.—A simple graphic rating scale. 
ascertain if used in season and so record. 
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Each judge rates each subject on a separate scale by making a 
mark on the scale where he thinks the subject falls. The 
distance of the mark from “Completely submissive” taken as 
zero is then measured in units of the spatial scale. The final 
rating of each subject is the average of the ratings given him 
by the several judges, provided there is a tendency toward 
agreement among them. ‘The scale may become more objective, 
however, if a subject is scored, say, “80 per cent dominating" 
because he is observed to dominate (as tangibly defined) in 
80 per cent of his contacts. This assumes that one contact is 
equal to another for the purpose in hand; but weighting may be 
applied if needed. Evidently, this kind of scale cannot claim 
the precision of scales in the physical sciences; but it is capable 
of very useful results. 

If the ordinal! numbers derived from ranking are subjected to 
arithmetical treatment, such as addition or the calculation of 
means, it is implicitly assumed that the ranked instances are 
equally spaced on a linear scale. Thus, if we rank cities in 
respect to the efficiency of their governments, beginning with 
the least efficient, so that city C is 1, city A is 2, city B is 3, 
etc., and if we then use these ordinals as cardinals in arithmetical 
calculations, we imply that the government of city A is twice as 
efficient as that of city С, that the government of city B is 1.5 
times as efficient as city A, etc. This assumption is, of course, 
but sometimes it is the best that can be done, or it is 
lem. The zero point on such a 
y coincident with or one unit 


inaccurate, 
good enough for a particular prob 
Scale is arbitrarily placed, usuall 
below the lowest rank. 


The most elaborate effort to build an exact scale yet made 


in the social sciences 18 probably that of L. L. Thurstone in the 
сазе of his scale for the measurement of an attitude, a sample of 
which is reproduced below.? Generalizing оп Thurstone’s 
method, and introducing minor modifications, it runs about as 
follows. А considerable number of supposed indexes of the 
attribute to be measured are chosen. Let. us say that the 
attribute is “radicalism” ; then the indexes might include mem- 
bership in the Socialist party, admitted statements made against 

л А cardinal number tells how many ог how much; an ordinal number 


Жс е у J. Онлув, The М: ‘easurement of Attitude, Univer- 
sity of Chicago Press, Chicago; 1929. 
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and books, subscriptions to this or that radical doctrine, atheism, 
unconventional sexual behavior, and so on. After these indexes 
have been selected and defined as objectively as possible, they 
are submitted to a number of qualified judges, who are asked to 
rank them in the order of the degree of radicalism that each 
seems toimply. Indexes that appear to indicate about the same 
degree of radicalism are regarded as ties. The indexes are thus 
collected into successive piles, which to the judges should seem 
to be equally spaced apart in degree of radicalism. When the 
he indexes, each index is assigned 
the average rank given it by the several judges, except that any 
index about which the judges differ too much is rejected entirely. 


Index C ylndex A Index В 
у Е ЕЗЕН je r0 СРМ 
035 ИБ 520 80 85 909 95 100 


Fig. 2.—Diagram of а generalized Thurstone attitude scale. 


Each index will then have an average rank or scale value, and 
these values may if desired be converted to a percentage scale, 
from the lowest value taken as zero to the highest value taken 
as 100 (see Fig. 2). The scale is then ready to be applied to 
other samples of instances (say persons), by simply checking 


Thurstone’s attitude 


scale has often given results that cor- 
related highly with thos Е 


, those obtained by much simpler procedures, 
such as graphic rating scales, and ratings! or rankings represented 


by consecutive numbers. It has also been criticized on various 
theoretical grounds,? 


1 For example, individuals are classified as Very Radical, Radical, Neutral, 
Conservative, Very Conservative, and those in the Very Radical group are 
given а Score of one, those in the Radical group a score of two, etc. 

*See В. К. Merron, Fact and Factitiousness in Ethnic Opinionnaires, 
American Sociological Review, Vol. 5, рр. 13-28, 1940. 


ZEN ON 
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SAMPLE OF A THURSTONE ÅTTITUDE SCALE! 


EXPERIMENTAL STUDY OF ATTITUDE TOWARD THE CHURCH 


Check (И) every statement below that expresses your sentiment toward 
the church. Interpret the statements in accordance with your own experi- 
епсе with churches. 

Scale 
value 


1. I think the teaching of the church is altogether too superficial 
to have much social significance. ы ны ee | So 
2. I feel the church services give me inspiration and help me to 
live up to my best during the following week... ennt 297, 

3. I think the church keeps business and politics up to а higher 
standard than they would otherwise tend to maintain...... 2.6 
4. I find the services of the church both restful and inspiring... 2.3 
5. When I go to church I enjoy а fine ritual service and good 

UBO eensaam trou ИКАР a one! 

6. I believe in what the churc 
reservation..... «ett tts 
7. 1 do not receive any benefit from atte! 
but I think it helps some people...--::7 777.7 
8. I believe in religion but I seldom go to church. е 5.4 
9. І am careless about religion and church relationships but I 
would not like to see ту attitude become general. M 
10. I regard the church as 2 statie, crystallized institution and as 
such it is unwholesome and detrimental to society and the 


individual.. ааыа E НЕ a 
! 11. I believe church membership is 8 


8.3 


abuti boat. crie saena inris А. ee 
x s of the church but I 


12. I do not understand the dogmas or creed: 
to be more honest and 


find that the church helps me Sd 
буй аг... sevo p mee cae NE eae ita ; 
13. The paternal and benevolent attitude t 


jg Отан ва relig 
5. Sometimes I feel that the ¢ 
and sometimes I doubt іб... р | 
16. I believe the church is fundamentally sound but some of its ^ 
ү; Ветер have given Е nano tes И Dc 
7. I think the church is а parasite on society . 


m L. Тновзтомв and E. J. Cuave, The Measurement 
iversity of Chicago Press, Chicago; 1929. 


of Attitude, p- 61, 
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There are also several important methods of converting ranks 
to a scale having more or less equal units and an arbitrary zero 
point that are outside the scope of this text. Probably the 
most scientific is the mathematical method of curve fitting.? 

When the concepts of space, time, money, weight, mass, and 
30 on, are used in sociology, they are of course amenable to 
accurate measurement by scales already scientifically established, 

8. Discrete Aggregates.—Population aggregates are of great 
importance in sociological studies. It is possible to define these 
aggregates (communities, neighborhoods, families, and the like) 
so that their number can be counted. We hold that it is also 
possible to measure the size of such aggregates by counting the 
number of individuals that compose them. * We do this in the 
belief that the only essentials of measurement are units that 
are equal and interchangeable for a purpose. The sociologist 
finds it more useful for his purposes to measure the size of the 
family in terms of the number of its members than in terms of 
their weight in pounds or their height in inches. The nature 
of a “member” does not vary from person to person in any way 
that interferes with the purpose. Moreover, since there is no 
point in subdividing a “member,” nothing is lost because it is 
logically a discrete unit. "This idea of measurement can also be 
extended to any other sociological concept that can be broken 
down into parts that are equal and interchangeable for the 
purpose in hand. 

9. The Measurement of an Intangible Quality —All attempts 
to measure an intangible quality, such as an attitude, must, of 
course, be indirect in type. The classic example of indirect 
measurement in the physical sciences is a thermometer that uses 
the changing length of a column of mercury as an index of change 
in the amount of the intangible quality “temperature.” In 
the case of the indirect measurement of a quality Y(temperature) 
in terms of an index Х (mercury column), there should ideally 


1 бее J. P. Gurrronp, Psychometric M. ethods, McGraw-Hill Book Company, 
Inc., New York, 1936; P. М. Symonps, Diagnosing Personality and Conduct, 
pp. 86-89, D. Appleton-Century Company, Inc., New York, 1931. 

2 Karu J. Нощшахокв, Statistical Methods for Students in Education, pp. 
221-224, Ginn and Company, Boston, 1928; С. Н. RICHARDSON, An Intro- 
duction to Statistical Analysis, Chaps. VIII and X, Harcourt, Brace and 
Company, Inc., New York, 1934. 
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be a perfect straight-line relationship between the two (see Chap. 
X), so that each unit change in X represents & constant amount 
of change in Y. But since Y is an intangible and cannot be 
directly measured, there is no way of proving that such a relation- 
ship exists between Y and X. So, while we may be certain that 
a scale distance of say 4X is twice as great as a scale distance of 
2X, we cannot be certain that a scale distance of 4X represents 
twice as much of Y as does а scale distance of 2X. A child 
With an I.Q. of 120 is probably not just twice as intelligent as 
another child with an 1.0. of 60. АП devices of indirect measure- 
ment, including the thermometer, are open to this objection. 
But the scientific and practical usefulness of the thermometer 
and of other indirect measuring devices suggests that for many 
Purposes this is not serious. Usually the important things are 
rather that the same absolute reading on the X scale shall always 
represent the same amount of the intangible quality Y, as 
Verified by introspection or by some external result in which we 
are interested (e.g., at 32°F., water freezes); and that the X scale 
Shall be able to differentiate changes in Y small enough for our 
Purposes, We shall then know what to expect from Y when 


the scale registers a certain value of X. If the relationship is 
close enough to permit a useful prediction of Y from the reading 
uable and is not to be 


E the X scale, the latter may still be val 
iscarded until a better index is found. ; 
In practical scale or score-card making, where there is an 
attempt to measure an intangible quality Y in terms of a tangible 
Index X, it is often helpful to set up & “fundamental interval" for 
Subdivision. This is done by selecting two extieme observable 
‘stances of У, marking the values of X corresponding to them 

” and, say, “100” respectively, and dividing the included 
Tange of X into 100 equal units. In the case of one thermometer, 

e extreme instances of temperature are taken at the QN 
Point of ice and at the condensing point of steam. Ава paralle р 
iM mental testing, for certain purposes we might regard inability 
О Pass the first grade in school as indicative of zero intelligenco, 
3nd ability to finish the university with honors 8$ indicative 2E 
100 per cent intelligence, and represent intermediate dup o 
'ntelligence by scores between 0 and 100. As with most ther- 
Mometers, for many purposes the Zero point need not denote H 
absolute zero, and the upper limit need not mean the ultimate 
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maximum amount of У. Itis important, however, to make sure 
that the “fundamental interval” includes as large a range of 
data as investigators will require. 

When the quality У is subjective (e.g., happiness in marriage), 
it has already been implied that there are two ways of testing the 
amount of relationship between it and the tangible index X 
(e.g., a score on the Burgess-Cottrell! scale for measuring happi- 
ness in marriage): (1) by comparing the amount of Y indicated 
by the X instrument with the subjective judgment of the subject 
ог of à competent observer (e.g., couples getting high scores 
on the Burgess-Cottrell scale consider themselves happy)—this 
is appropriate if interest centers in the subjective quality as such; 
and (2) by checking the readings of the X instrument against 
certain tangible conditions that are ascribed to Y (eg., low 
happiness scores on the Burgess-Cottrell scale are followed by 
divorce more often than are high scores). These are called 
tests of validity. Validity is also established in part by defini- 
tion and agreement, e.g., the cooperative definition described in 
Chap. IV. Chapin's living room scale, mentioned above, is 
intended to measure the socioeconomic status of the homes to 
which it is applied. The fact that the card has given higher 
Scores when applied to upper middle class homes than when 
applied to middle class homes, determined independently, is 
evidence of its validity. Its reliability was established when 
different observers used it on the same homes with little variation 
in results. 

Evidently, the indirect measurement of a subjective quality 
must wait upon the discovery of a satisfactory tangible index, 
which is to be sought among the apparent results or causes of the 
subjective quality, among the results of common causes, or, 
from a different point of view, among the external aspects of 
the subjective concept. Thus, the expansion and contraction 
of the column of mercury in a thermometer are apparently the 
result of changes in temperature. 

Whether the measurement of an intangible quality by means of 
a tangible index or by means of introspective ratings converted to 
scale values is superior depends upon particular circumstances, 
and especially upon the direction of interest. If Possible, both 
should be carried through for purposes of validation, 


Е. W. Bunazss and Lzonanp J. Соттветл, Predicting Success or Failure 
in Marriage, Prentice-Hall, Inc., New York, 1939. 
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10. Rules of Measurement.—We summarize below what 
are probably the most useful rules of measurement in social 
research. 

1. The quality that it is desired to measure should be defined 
verbally as clearly as possible in the beginning. But the meas- 
urement of а quality is also a crucial part of its definition. In 
fact, “what the scale measures" may later be regarded as pre- 
ferable to the verbal definition, as equivalent to it, or as not at 
all equivalent to it, depending on the degree of validity estab- 
lished for the scale and the usefulness of its results. 

2. The purpose of the measurement should be stated or 
understood. 

3. The unit used should be appropriate to the purpose of the 
measurement. 

4. Units should be equivalent one to another (equal, inter- 
changeable) for the purpose in view; except that in the indirect 
measurement of an intangible quality in terms of а tangible 
index the equality of the intangible units is indeterminate, and 
for many purposes is unimportant. 

If the units of a scale are sufficiently equal for à purpose, itis 
safe for that purpose to add or average them, to interchange 
them, or to claim that, say, two units represent twice as much 
'of the quality as does one unit. 

For the historian, one year is not equivalent to another; for 
the actuary constructing a life table, it is. 

5. The unit should be applied as exclusively as possible to 
the quality defined for measurement, in accordance with the 
purpose stated. 

That is, in measuring & man’s height in inches, we should not 
include his shoes, nor should we measure him in a slouched 
posture. So, in measuring “intelligence,” we should, if possible, 
exclude inequalities of effort. / 

. 6. The unit should be applied to the entire range in which 
the investigator is interested. 

In applying an inch end-over-end, or an inch scale, to measure 
the height of а man, no part of the total distance that is his 
height should be skipped or measured in other than a single 
Straight line, When а Fahrenheit thermometer registers the 
temperature, however, it reads above or below a fixed point 
that is arbitrarily called zero. This is adequate for ordinary 
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3 purposes, because most of us are interested only in the range of 
temperature included in the thermometer, and not in an extension 
of that range to a depth never observed in ordinary experience. 
But for some scientific work, the temperature needs to be 
measured from a true zero point, and a different scale is used. 

The ratio of two measurements holds only with reference to 
the zero point from which they are made. If this is not an 
absolute zero, that fact should not be forgotten when interpreting 
the ratios, 

7. The size of the unit should be fine enough to detect the 
smallest differences that are of importance for the inquiry, but 
need be no finer. 

8. Final judgment of an instrument designed to measure an 
intangible quality should depend chiefly on tests of its validity 
and reliability. 

Summary.—We have seen that even “subjective” qualities 
are amenable to a great deal of statistical analysis through 
counting, classification, ranking, and rating. They cannot be 
exactly measured unless the form of their theoretical distribution 
is known a priori, or unless they are perfectly correlated with 
some objective index; and it is seldom or never possible to 
demonstrate completely either of these propositions. Neverthe- 
less, such qualities have already been measured in both the 
natural and the social sciences successfully enough to satisfy 
many important scientific and practical uses. Devices like 
the Binet test and like those used to score social attitudes, 
socioeconomic status, personality traits, and so on, are promising 
approaches to measurement in social research, and their rapid 
improvement and extension to cover many more sociological 
concepts are to be anticipated. Moreover, objective qualities 
in which sociology is interested not only can be counted, classified, 
and the like, but they can also either be measured by the scales 
already standardized by the physical sciences or they should 
offer no difficulties that are peculiar to the social sciences. 


Exercises 


1. Is anything more than clearness of definition necessary to render 
data amenable to statistical treatment? Illustrate. 

2. Can classification and counting alone form any basis for statistical 
analysis? Illustrate. 1 


3. What are the main points to watch in the use of classification? 
Illustrate. 


LE ът >, 


IE al Oe Са Ит. 
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4. How does classification differ from rating? Illustrate. 
5. Give an example of the kind and amount of ability a judge should 
have to qualify as a "rater." 
6. Name at least one method of converting ranks to scale values. 
7. Devise a simple graphic rating scale for the personality trait of 
“sociability.” 
8. Describe some scoring device used in sociology. What is your 
opinion of it as a measuring instrument? 
9. Distinguish between a scoring device and a scale in the strict 
mathematical sense. 
10. Distinguish between counting and measurement. 
11. Illustrate a sociological problem where counting is equivalent 
to measurement. 
12. Discuss the possibility and necessity of equal units in the measure- 
ment of an intangible quality. 
. 13. What is of chief importance in the indirect measurement of an 
intangible quality? 1 
14. What is meant by the validity of a measuring scale? By its 
reliability? How can an instrument designed to measure an intangible 
quality be validated? Illustrate. d 
15. Give an example of an intangible quality of interest to sociology, 
and describe briefly two ways in which it may be measured. 
16. What method of measurement would you apply to answer each 
of the following questions: 
a. Does divorce tend to increase with family income? 
b. Do the ablest people leave the farm for the city? 
c. How do 10 cities compare in respect to good government? 
17. What is the reason for taking a number of measurements of the 
Same thing and averaging them? 
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CHAPTER III 
FACTOR CONTROL 


Among the social sciences the controlled experiment has been 
employed much less than in the natural sciences. As a rule, 
sociologists have either preferred or felt obliged to investigate 
social situations in all their original complexity and confusion. 
The methods for dealing with this kind of data attempt to 
introduce control by means of classification in the case of attri- 

- butes (unmeasured traits, e.g., married, single) and by mathe- 
matical devices in the case of variables, (measured traits, e.g., 
age in years). 

1. The Actuarial Method.—One of the most effective schemes 
of classifying attributes is similar in general ‘principle to that 
employed by actuaries in determining insurance risks.! For 
example, a large number of paroled criminals may be sorted into 
relatively homogeneous groups with respect to various criteria, 
such as number of previous arrests, prison record, age, type of 
offense committed, intelligence, and so on, and the rate of 
violation of parole determined for each group. After proper 
testing, these rates may then be used as estimates of the proba- 
bility of violation of other prisoners who fall in the established 
classifications. 

We begin with a specified group of items, say paroled pris- 
oners from the Joliet (Ш.) penitentiary on Jan. 1, 1941. The 
simplest classification is a dichotomy, or separation of the A’s 
from the Not A’s. Thus, our parolees may be divided into the 
married and the not married. If we wish to test whether 


marital status (trait A) is associated with success on parole * 


(trait B), we compare the proportion of successful parolees 
(B’s) among the married parolees (A’s) with the proportion 
among the not married parolees (not A’s). When there is no 
association, i.e., the traits A and B are independent, the two 

1 For a more thorough development of this technique, вее G. U. Yule and 


M. G. Kendall, An Introduction to the Theory of Statistics, Chaps. I-V, 
Charles Griffin & Company, Ltd., London, 1937. t 3 И 
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proportions will be the same, except for chance errors. In 
other words, if 80 per cent of the married parolees succeeded, 
but only 60 per cent of the not married parolees did so, we would 
conclude that marital status was favorable to success on parole. 

Suppose we believe that a good prison record (trait C) also 
makes for success on parole. We test it in the same way as we 
did marital status above, and confirm our belief. It may then 
be worth while to make a double classification of the parolees 
by marital status and by prison record, as shown in Table 1. 
From this table we note that the proportion of successful parolees 
in the group as a whole is $50 = 0.72, among the married is 
$$$ = 0.80, and among the married with a good prison record is 
1$ = 0.93, approximately. On the other hand, among the 


TABLE l.—CrassiricATION or 500 PAROLEES BY MARITAL STATUS AND 
Prison Recorp, Jouwer, Iut., Jaw. 1, 1941. (Hyrornetican Data) 


Parolees, married peace а 
Outcome Total 
Record | Record | Record | Record 
good |notgood| good | not good 
65 175 25 95 360 
5 55 10 70 140 
70 230 35 165 500 


not married parolees with a not good prison record, the proportion 
of successes is 1955 = 0.58 nearly. Evidently, in future groups 
9f parolees chosen in the same way and exposed to the same 
Eeneral conditions as were the 500 represented in Table 1, a 
Married man with a good prison record may be expected to have 
a much better chance of succeeding than a man not married with 
a prison record that is not good. More specifically, for every 
man of the first type that failed, we should expect 6 of the second 
type to fail, out of equal numbers placed on parole. 

It is, of course, possible to subclassify the cases in Table 1 
still further, either by substituting more complete breakdowns 
for the dichotomies (e.g., married, single, divorced, widowed for 
married, not married), or by introducing additional factors 
(е.7., employment record before arrest).! 

+ See Chap. XI. 
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2. The Search for Causes.—It is often said that the under- 
lying purpose of all science is prediction. Certainly, scientific 
research constantly seeks to discover causes. Much philo- 
sophical dispute has occurred regarding the nature and reality 
of a cause, but we shall here say only that we mean by a cause 
any factor whose change under controlled conditions is invariably 
followed or accompanied by a change in a second factor. The 
logicians refer to this as concomitant variation. The kind of 
causes with which practical science is most concerned are simply 
factors that give the easiest and most reliable prediction, or 
understanding, of certain conditions that constitute a problem. 
Thus, if we can always change the divorce rate in a given type of 
social situation by changing the proportion of Protestant- 
Catholic marriages, the intermarriage of Protestants and Catho- 
lies may be regarded as one cause of divorce under the given 
conditions. In the social sciences, there are always many 
causes that combine to produce any actual situation or result. 
Evidently the divorce rate of a city is the product of a vast 
number of forces, only some of which can be discovered or 
controlled. . 

3. Matching Experimental and Control Groups.—The logical 
requirements for establishing a causal relationship are the same 
in every science.! It is always necessary to establish the fact 
of concomitant variation. For working purposes, the procedure 
is essentially to introduce, remove, or vary in amount the 
suspected cause, and then to observe or measure the correspond- 
ing changes, if any, in the thing that is expected to be affected. 
For example, suppose that we want to test the belief that knowl- 
edge of the evils of aleohol will prevent young people from 
drinking. We expose a number of such persons to appropriate 
instruction and note what proportion of them acquire the habit 
of drinking within, say, a two-year period. In this group, called 
the experimental group, the supposed cause is present. А second 
group of young people, which may be termed the control group, 
is given no instruction, so that the Supposed eause is absent. 
After two years, the proportion of habitual drinkers is deter- 
mined in this group also, and the proportions are compared 
between the experimental and control groups. If the experi- 


+ See Jonn Dewey, Logic: The Theory of Inquir. 101 9, 
Henry Holt and Company, Inc., New York, 1938.” а м 
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mental group shows a lower percentage of drinkers than the 
control group, however, it still cannot be said that the instruction 
made the difference, unless it can also be shown that nothing 
else is likely to have done so. Thus, it is possible that the 
experimental group contained a considerably larger proportion 
of women or of church members than the control group, which 
might make the comparison unfair. It is evidently necessary in 
апу experiment that the experimental and control groups shall 
be essentially alike in all important respects that might affect 
the outcome, except for the factor or factors under investigation. 
This must, of course, be taken care of when the experiment or 
investigation is being planned. The young people in our 
experimental group must have no characteristies, except the 
instruction, that will make them more liable or less liable to 
become drinkers than those in the control group. The usual 
way of trying to insure this equality is to match the two groups 
in respect to every important point that may be related to 
drinking, such as age, sex, family background, church member- 
ship, present drinking habits and attitudes, and soon. Moreover, 
all conditions must remain approximately the same for the two 
groups during the two years that the experiment is under way. 

4. The Principle of Randomization.—In sociological research, 
however, it is seldom that an investigator can feel that his 
€xperimental and control groups are actually matched in all 
Important respects needed to insure a valid comparison between 
them. He is, therefore, obliged to summon to his aid the princi- 
ple of randomization. Having matched his two groups as well as 

€ reasonably can, he then decides by a random draw which 
of each pair of matched subjects, or which subjects from the 
total lot, shall belong to the experimental group and which to 
the control group. If this is not feasible, it may be decided by a 
draw which of the two matched groups shall be the experimental 
One, or this may be done in addition to the above. As long 
as there are only two groups, this latter method of randomization 
alone is not very effective. The experiment will be better 
designed if there can be several groups, or replications, half of 
which are drawn at random to serve as the experimental groups. 
In some cases, indeed, the whole process of matching the groups 
may best be omitted, and dependence placed in subdividing the 
Potential events—eg., а large number of unselected young 
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people—into two or more groups by random selection. When 
any good method of randomization is used, all initial differences 
between the experimental and control groups should be accidents 
of chance. 

D. Pretests and Final Tests.—Whatever method of equaliza- 
tion is used, it is well before subjecting the groups to the condi- 
tions of the experiment to test them to see how much alike the 
experimental and control groups really are in pertinent respects. 
'This is usually done by means of а pretest, which is the same 
аз the final test that will be used at the end of the experiment to 
measure the differences between the groups at that time. Thus, 
in our illustration, we might set up a battery of questions about 
drinking habits that would enable us to decide to what extent 
а young person drank or was predisposed to drink, and if the 
experimental and control groups scored about the same on this 
test, we might regard them as equivalent for the purposes of our 
investigation, 

6. The Influence of Additional Factors.—It is often desirable 
to test the effects of a third factor on the relationship between the 
independent and dependent factors in an experiment. In this 
case, the third factor is inserted and removed, with only the 
independent and dependent factors present. Thus, we might 
observe the influence of sex in studying the influence of instruc- 
tion on drinking. Both control and experimental groups would 
then be divided by sex, giving four groups rather than two. 

7. The Case of Continuous Variables.—In the illustration 
above, we were dealing with attributes, such as “instruction,” 
“по instruction,” “habitual drinkers,” “not habitual drinkers,” 
rather than with measured variables, like the amount of instruc- 
tion and the amount of the tendency to drink. Although there 
is no difference in principle between the two cases, there is 
some variation in procedure. Thus, if we wanted to measure 
the amount of the tendency to drink in relation to the amount 
of instruction given, we should take several groups instead of 
only two. To each of the several groups we should give a 
different amount of instruction, including no instruction at all 


1 For a more advanced discussion 
tical techniques of analysis of vari 
been developed in connection with i 
in Educational Research, Chaps. 
Boston, 1940. 


of this subject, together with the statis- 
‘ance and covariance that haye recently 
t, see Е. Е. Lindquist, Statistical Analysis 
IV-VI, Houghton Mifflin Company, 
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to one group, and note whether there was any relationship 
between the increasing amount of instruction and the tendency 
to drink after two years. As before, we should have to equate 
the groups in all important respects before experimenting with 
them, or else be prepared to make corrections for the differences. 
Of course, we should have to devise scales for measuring the 
amount of instruction and the amount of the tendency to drink, 
before we could treat these factors as continuous variables. 

8. Interfering Variables.—As in the case of attributes above, 
it is usually important to measure the influence of certain inter- 
fering variables. In our drinking experiment, some of these 
might be the attitude of the parents toward drinking, the sub- 
jects’ ages, their money incomes, and so оп. Such variables 
are not matched or randomized out of the experiment, but are 
introduced in varying known amounts, and their effects on the 
independent and dependent variables are measured. Factors 
may then be held constant, or their influence subtracted out, by 
mathematical methods. This type of analysis yields more 
information and information of a more practical kind than when 
all interfering factors are actually removed by matching or 
are equalized by randomization; and it is also generally easier to 
carry out. 


Exercises 


1. Illustrate the use of the actuarial technique in the prediction of 
Success in marriage. j 

2. Explain how you would obtain control over interfering factors in à 
Study designed to show the effects of the presence of children on the 
divorce rate, or other problem of your choosing. 

3. Comment briefly on the following published statements: 

а. "Despite marked advances in appendicitis diagnosis and surgery, 
Wisconsin's death rate from the ailment, which stood at 11.6 deaths 
Per 1,000 population in 1911, nevertheless increased to а rate of 18.2 
in 1980.2 

b. “Women Are Safer Drivers than Men Records Reveal: When Mary 
and Jack borrow Dad's car for a ride, they'll be smart if thev let Mary 
do the driving. 

1 Вее, for example, Mordecai Ezekiel, Methods of Correlation Analysis, 
Chap, XIII, John Wiley & Sons, Inc., New York, 1930; or G. W. Snedecor, 
Statistical Methods, rev. ed., Chaps. XII and XIII, Collegiate Press, 

€., of Io Ames, Iowa, 1938. 

t Wika Ga рей Неайһ Bulletin, Madison, April-June, 1935, 
р. 26, а 
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“For in spite of the young man’s claim to being a better driver, state 
highway commission records show that women drivers seldom are 
involved in fatal accidents. Young men, however, are involved in 
more fatal automobile crashes than any other age class of motorists. 

“Few women drivers are found on state highway commission fatality 
records, and only one person was killed in the last two years by a girl 
driver under 18 years of age. 

“State safety workers won't argue that Mary is a better driver than 
Jack, but they do claim that state records indicate she is a safer driver." 

c. “Homemaking Careers Attracting More Girls: In increasing number, 
girls are turning attention these days to homemaking as a career. 

“The popularity of homemaking courses is shown in the increasing 
enrollment in home economies at the University of Wisconsin where 
enrollment this fall is nearly 10 per cent above 1936, according to the 
director of the course." 

d. “There has been more social progress in the United States in the 
last 18 years since women have had the vote." 

e. “The Distilled Spirits Institute, demanding that the Anti-Saloon 
League recognize the prevailing downward trend of major crimes, bases 
its case largely on this general statement: The total (of all crimes) for 
the calendar year 1936 showed a decrease of 112,055 offenses as com- 
pared with 1935.” 

(Turn in to the instructor two examples of the misuse of statistical 
reasoning clipped from newspaper or magazine.) 
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CHAPTER IV 


THE STATISTICAL INQUIRY 


1. The Role of Nonquantitative Methods.—Access to non- 
quantitative methods, such as the historical method, the case 
study, and the general interview, is not to be denied the statistical 
investigator in sociology. Many of his problems and ideas will 
be suggested by working with materials of these kinds before 
the statistical study is set up. Also, during the progress of the 
collection of the statistical data and analysis of them, he will 
usually find it invaluable to interview or talk with the informants 
and their neighbors, to saturate himself with their points of 
view and backgrounds, and to judge the reliability of their 
replies to formal schedule questions by shrewd observation. 
Finally, as suggested in Chap. I, in interpreting his statistical 

ndings, some important questions are almost certain to arise 
that cannot be ànswered from the figures in hand, and he will 
Want to go back to the living situations for fresh suggestions. 
The statistical investigator is expected, however, to limit his 
Ormal conclusions to those arrived at by tested quantitative 
Methods, 

2. The Problem.—The statistical problem in sociological 
Tesearch may vary from what is exploratory and merely fact 
finding to the testing of a sharply stated hypothesis, depending 
Upon how much is already known about the subject. We may 
Set up a study to find out anything we can about divorce 1n the 
-hited States, or we may limit the inquiry to testing the hypothe- 
Sis that the occupation of the husband plays an important 
Part in the situation. Exploratory ОГ fact-finding studies 
Should be regarded as merely preliminary to more specific and 

etter controlled studies, because the former cannot penetrate 

"Death the surface of social phenomena. The problem should 

also be cut to fit the limitations of time, money, and personnel 

alifications at the disposal of the investigator. It should 

usually be a problem of obvious theoretical or practical impor- 
31 
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tance, although a certain amount of research without apparent 
value but of interest to the investigator should be encouraged, 
because this kind of probing about has sometimes resulted in 
important scientific discoveries. The availability or lack of 
availability of reliable statistical data is another consideration 
that will affect the choice of a problem. This bears on the 
point that the problem must be capable of quantification or 
measurement. Above all, the problem should lie in the field of 
methodological and informational competence of the investigator, 
but as far as possible outside his field of personal bias. There is 
sometimes a conflict here, as when a Negro sociologist wishes to 
investigate the social conditions of the Negro race. He should 
know the field better for being a Negro, but he is likely to carry 
into the study a racial sympathy that may influence his findings. 
It is very desirable for an investigator to state frankly his biases, 
as well as to do his best to overcome them. 

Of course, no problem should be finally selected until it is 
known to what extent and by what methods it has already been 
studied. Although some investigations need to be repeated or 
done differently for confirmation, it sometimes happens that 
a problem has been very satisfactorily solved, and further work 
on it would be a waste of time. What is more likely is that 
certain angles of the problem have been worked out, but other 
angles remain to be investigated. The research worker is, 
therefore, guided by a knowledge of previous work into the 
most profitable channels for further study, and may obtain 
suggestions and warnings from what others have done. 

In dealing with a statistical problem of the more scientific 
sort, it is indispensable to state the problem as a formal hypothe- 
sis or hypotheses to be tested. Such a hypothesis should be so 
worded that the task of the investigator is made as easy as 


1 Aids in locating previous sociological research on a topic include the files 
of The American Journal of Sociology, The American Sociological Review, 
The Journal of Social Forces, Sociology and Social Research, and Population 
Index; Social Science Abstracts (1929-1932); Р. К. Whelpton, Needed Popula- 
tion Research, Science Press Printing Company, Lancaster, Pennsylvania, 
1938; The Psychological Index; Encyclopedia of the Social Sciences, E. В. А. 
Seligman, ed., The Macmillan Company, New York, 1930; Poole’s Index to 
Periodical Literature; Readers’ Guide to Periodic Literature; Annual Magazine 
Subject Index; Book Review Digest; United States Catalog: Books in Print; 
Cumulative Book Indez. 
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edes It is usually simpler to use a positive hypothesis than 
Eds un ive one, and then to try to disprove rather than to prove it. 
= ү ns ses we can never prove а general affirmative 
T | ion because we cannot examine all possible cases; but 
Pis gle exception may effectively disprove it. Thus we might 
tm = a Бш, “Any association found between the birth 
е е business index, with the marriage rate held con- 
ni, 18 due to chance errors," and seek to show that in our 
x m sample it is not due to chance errors. We can only 
Eur ve, or fail to disprove, such a hypothesis. For practical 
r poses, however, we may regard as provisionally true any 
[Ошен that careful tests have failed to disprove. 
Bron Statistical Data.—Research is a cooperative 
oa dere and the social investigator often necessarily 
2X3 am collected by someone else. „Тһе chief sources of 
Em ary statistical data that are of interest to sociologists 
Fed e publications of the various bureaus and divisions of the 
'"deral, state, county, and municipal governments, and a few 
Private agencies.! 


the United States include the Bureau of 
Life and Welfare of the Department of 
Itural Economics, the Bureau of Labor 
Public Health Service, Works Projects 
{ the Bureau of the Census, 


1 

the енна Federal agencies in 
bis ШЫ the Division of Rural 
Scans the Bureau of Agricul 
ага the Children’s Bureau, 
iE it Division of Vital Statistics о of t 
Statisti ошын Committee, Interstate Commerce Commission, Central 
le Дег] Board, Department of Commerce, Department of the Interior, 
dmini oo of Investigation, National Archives, National Youth 
ties mee Tennessee Valley Authority, Women’s Bureau, United 
gricult; mployment Service, Immigration and Naturalization | Service, 
€ aby Adjustment ‘Administration, Farm Security Administration, 
Affairs i Education in the Department of the Interior, Office. of Indian 
m in the Department of the Interior. А current summary of Federal 
penu their subdivisions and activities, is available in the United States 
с Manual issued by the National Emergency Council. A general 
ments Ог the purchase of Federal documents is the Superintendent of Docu- 

Int All these agencies are located in Washington, D.C. 
health ormation about births, marriages, divorces, deaths, and the public 
offices is published by state bureaus of public health or vital statistics, with 
educatio: the state capitals. State bureaus of correction, departments of 
ning RE departments of agriculture, departments of public welfare, plan- 
ачу {ах commissions, and the like are important sources of data for 
tural Se social conditions. State and private universities and agricul- 
eges also gather and interpret a great deal of information. The 
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Any serious statistical research project will, of course, soon lead 
far beyond any general summary of sources of data. Much of 
the success of the trained investigator depends upon his ingenuity 
and persistence in discovering the available data that are per- 
tinent to his problem. Intimate familiarity with the field of 
investigation is the best aid here. 

After secondary data are found, however, the investigator 
must examine them carefully and critically before he can safely 
use them for his special purpose. He needs to know (1) the 
definition of the thing that is enumerated in relation to his 
purpose, or (2) the definition of the whole that is measured and 
of the unit by which it is measured, (3) the exhaustiveness and 
mutual exclusiveness of the classification, (4) changes in the 
definition, (5) the extent of actual over- or underenumeration or 
measurement, (6) the date or period in time to which the data 
apply. 

A few examples may be of help. In the 1935 Census of 
Agriculture in the United States, a farm was carefully defined as 


. all the land which is directly farmed by one person, either by his 
own labor alone or with the assistance of members of his household, or 
hired employees. А ranch, nursery, greenhouse, hatchery, feed lot, 
or apiary is considered a farm. Establishments keeping furbearing 
animals or game, fish hatcheries, stockyards, parks, etc., are not con- 
sidered as farms unless combined with farm operations. 

The enumerator was instructed not to report as a farm any tract of 
land of less than 3 acres, unless its agricultural products in 1934 were 
valued at $250 or more. 


Brookings Institution of Washington, D. C., the National Bureau of Eco- 
nomic Research of New York, the Russell Sage Foundation of New York, 
the Scripps Foundation for Population Research of Oxford, Ohio, and the 
Gini Foundation of Palo Alto, Calif., are private organizations whose work 
is of value to social investigators. 

The latest copies of the Statistical Abstract of the United States, published 
by the United States Department of Commerce; the Abstract of the Census 
of the United States, published by the United States Bureau of the Census; 
and the World Almanac, obtainable at most newsstands, are of frequent use. 
Bibliographies include those of Dorothy C. Culver, Methodology of Social 
Research: A Bibliography, and of A. Е. Kuhlman, Public Documents. 

The League of Nations, the International Labor Office, and the Inter- 
national Institute of Agriculture publish much statistical material of world 
interest, available in public libraries. 
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A farm may consist of a single tract of land, or of a number of separate 
tracts. These several tracts may be held under different tenures, as 
when one tract is owned by the farmer and another tract is rented by 
him. When a landowner has one or more tenants, croppers, or man- 
agers, the land operated by each is considered a farm. Thus on a 
plantation the land operated by each “‘cropper” or tenant was reported 
аѕ a separate farm. The land operated by the owner or manager, by 
means of wage hands, was likewise reported as a separate farm. 


That this definition of a “farm” nevertheless did not suit the 
Purposes of all users of the census appears from comments like 
the following: 


The census uses a concept of a “farm” which is an arbitrary statis- 
tical definition violating any sound reasoning from whatever standpoint 
We may choose. In counting farm operators the census makes no dis- 
tinction between the sharecropper on the one hand, and, on the other 
hand, the farmer who operates his property either personally or with the 
aid of a manager and the tenant who operates a farm—strange as it may 
Seem, in current American agricultural statistics the plantation does 
not exist. Paradoxically enough, it lives statistically under the dis- 
Euise of its direct competitor and adversary, the small family farm . . . 


‘nobody knows how many plantations existed in the United States in 


1920, 1925, 1930, or 1935.1 


A great many more farms were enumerated by the Census of 
Agriculture in 1935 than in 1930. Between these two censuses 
no change was made in the definition of a farm; yet there is 
evidence that the 1935 census counted as farms many plots 

lat were not counted as farms in 1930, especially in or near 
mining and industrial areas. ‘The depression and unemployment 
Caused the occupants of these plots to give more than ordinary 
attention to gardening, chicken raising, and other home produc- 

ìon, and as a result these rural home places were lifted into the 
arm class. Since the families and the plots were otherwise just 
© same as they had been in 1930, and the “farmers” added by 

® 1935 census were actually miners and industrial workers who 
Would return to their usual employment at the first opportunity, 


* has been felt that the heavy increase in the number of farms 
As usual, however, the error, 


reported was large] ri 
te spurious. eu 
gt e periphery of the definition 


i 
1 may be so called, occurred on th 
x Karu Ввлкот, Fallacious Census Terminology and Its Consequences 1n 
Culture, Social Research, Vol. 5, рр. 19-37, 1938. 
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where the concept defined (a farm) shades off into something 
different (not a farm). Most of the farms added in the above 
manner were quite small, and the value of their products was 
so close to the minimum of $250 that they might easily slip 
in and out of the farm category. The number of farmers 
returned by the census of agriculture is never the same as the 
number found by the accompanying census of occupations. 

In the case of farm laborers, including members of the farmer’s 
family working on the home farm, the problem of definition is so 
difficult that not much reliance can be placed in the figures 
furnished by the census. In addition, the census of 1920 was 
taken as of Jan. 1 and that of 1930 as of Apr. 1, and this shift 
of date alone caused a sharp variation in the number of farm 
laborers reported. It is well known that the census of popula- 
tion underenumerates young children, Negroes, and other 
classes that for one reason or another are likely to be overlooked; 
that the reporting of the population by years of age overloads 
the 5’s and 10's (e.g., 15, 20), at the expense of the other years 
(e. g., 14, 17, 19, 22); and so on. 

Such examples suggest only a few of the many pitfalls that 
lie in secondary data, even when collected by a great national 
agency like the Bureau of the Census, which may be regarded 
as unbiased and thoroughly honest in those aspects of its work 
that cannot be checked by the consumer of the data. The 
dangers are usually much greater in the case of data supplied 
by the smaller public agencies, like those of states or cities, and 
by many private agencies. The best rule is to insist, as far as 
possible, on knowing what was done by the collecting agency 
at each step of the data-gathering process, from definitions to 
field work to final tabulation; and on noting what checks they 
have applied to test the accuracy reliability, and validity of 
their data. Only when the investigator is reasonably satisfied 
after à painstaking scrutiny of this kind that the data are appro- 
priately defined and sufficiently accurate for his purpose is he 
justified in going forward with the work of analyzing and inter- 
preting them. Research workers have wasted months of effort 
and thousands of dollars before they discovered that the material 
on which they were basing their conclusions was hopelessly 
inaccurate to start with. Obviously, no amount of mathematical 
treatment can make amends for data of this kind. 
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4. Primary Statistical Data.—The usual method of gathering 
firsthand data in sociological research is by means of the schedule 
or of the questionnaire. Both are sets of questions to be answered 
in blank spaces provided. The questionnaire is mailed out to 
informants and is not often to be recommended. Not only 
ere the persons addressed likely to misunderstand or interpret in 
diverse ways the questions asked, but they seldom answer all 
of the questions, and many of them make no returns at all, 
thereby tending to produce a biased sample. A much sounder 
plan is to have trained interviewers with a schedule visit the 
Persons who are to give the information,’ or transfer the data to 
the schedule from available records. The procedure properly 
begins with the formulation of the problem, and ends with the 
analysis of the data, because one step logically determines 
another, and a given investigation should be developed as an 
organic whole. 

b. The Schedule.—After the problem of fact finding or 
hypothesis testing and the general approach to it have been 
tentatively determined, the next step is normally to prepare 
the schedule. The schedule is nothing more than a list of the 
questions which it seems necessary to answer in order to test 
the hypothesis or hypotheses, or to get the facts at which the 
investigation is aimed. Much skill and labor are required to 
include all the essential questions and nothing more. Anything 

at is obvious or beside the point should be omitted. In 
addition, each question must be simple and clear, and must be 
answerable in terms of countable or counted units; and the 
Same question should have approximately the same meaning for 
Cach informant, The units must be capable of objective defini- 
tion, во that there will be no serious amount of disagreement 
E. Out specific instances. Birth rates, an index of business con- 
“ tions, marriage rates, age in years or months, 1.Q.’s, “male,” 

female,” “yes,” “no,” dollars, number of persons in family, 
Occupation, and so on, are acceptable units when carefully 
fined in context. So much difficulty has been experienced with 
а term like “occupation,” however, that the census bureau has 
Prepared a large manual with a detailed list of almost every 
i to carefully stratified classes of 


1Tt is Р x : 
possible to mail out questionnaires $ 
the s Е the light of answers obtained by 


ет ulation, and to correct the replies in 
Опа] visitation of much smaller samples from each stratum, 
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conceivable occupation, showing its schematic relationship to 
more inclusive occupational categories. 


THE ENUMERATIVE CHECK SCHEDULE 


SPECIMEN FORM EC-1 AND INSTRUCTIONS 


Printed below and slightly reduced from its actual size, is а specimen copy of EC-1 with entries made to Mustrate 
а enumerated, these include: а fully employed head of family, a housewife, a part" 
typical situatlons. rer temporarily absent, an unemployed worker, а new worker, a full-time student, a retired invalid, 
Hime worker, оң a special Government ог emergency project, ‘The specimen EC-1 Form as set out is followed by a 
вп а тоткег оп а he manner in which an enumerator might receive the answers which are recorded om it. ‘The 
fastroctiona printed on the back of EC-1 are reproduced on the opposite page. 


= orm Bot NATIONAL UNEMPLOYMENT CENSUS CONFIDENTIAL 
Enumeration of Persons Residing оз Selected Postal Routes 


City delivery 


Tecta No. 


[5571 


D. Does this household live oa a farm? Vilage delivery route No. 


с. Tots number of ется ir this houmhotd ..... 


каш 
of euch person 14 
Enee tat 


TAPORTANT 
| Test asy te erecta өз ther ite fC orm. 

а Deke емде ve ALL ted parsans tà domui, 23 aed 12 
вишни estos i "Ven" ash est fre comers t 


"шыла ease 


Олово». E... Bron... 


КҮ: yok castes з боша КП, 

[А 

мута tua 4i "is" poi tbo ta odes ao 
Trandate тиде ta ниша MA anor 


Thomas Е. Brown, the enumerator, begins his work on Monday, November 29, 1907. He has been instructed by 
the postmaster and furnished with a package of EC-1 Forms preaddressed for each dwelling on the route to whic 
he is assigned, as well аз a supply of EC-2 notices. The first EC-1 Form bears the address 2102 North Lake Street 
Tt is not a farm, so he writes "No" in answer to "BY, Mrs. Johnson answers the bell, and when Mr. Brown has intro- 
duced himself, explaining the purpose of his call, she gives him the following information ‘about the members of her 

usehold, 

‘There аге 11 in all, including 2 not yet 14 years old. Mr. Brown writes in о” and ^2" for "C and О", горео, 
tively, and proceeds to list the names of the 9 grown-ups, ard then to fill in the answers for cach as Mrs. Johnson 
responds to the questions. i 

е head of tho house Is Philip Joknson, age 56, He was fally employed during the week of November 14-20 at 
a regular She is his wife, has always kept house for the family and does not want, work for pay. Their oldest 
ton, George, ip 32. He haa a regular job but was put оп а part-time basis ín September and worked only, То houra during 
the week of November 14 es, he wants more, work. Helen who is 28 has a job. She was out sick the week of 
November 14-20, but has since returned to work. Arthur, age 24, worked for several years up to last summer when 
he was laid off." He wanted a job during the week of November 14-20 and has been temporarily away from home’:n 
another city trying to find one. Peter, аде 20, has not worked before but he, too, Is looking for a job. Mary, age Tf 
is still in school Paul Smith, age 80, Is Mra. Johnson's father who lives with her. He gave up working several years 
ago when his health made it impossible for him to carry on. Robert Jones, 24, is a roomer who is a laborer on à WPA 
project. ^ 

‘Mrs, Jobnson also says that another family, the Smiths, live upstairs in tha house. As he does not have ап addressed 
EC-1 Form for the Smiths, Mr. Brown fills in a blank form for them and proceeds to his interview with Mrs. Smith. 

wee 


А good rule is that each question in the schedule should be 
answerable either in terms of some standard unit like dollars and 
number of members in family (as defined), or in terms of а check 
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mark, code number, or letter that refers to a specific list. For 
example, after the interviewer learns the subject’s occupation 
he may enter in the schedule the code number of the appropriate 
classification in the census manual of occupations. Open 
questions, in answer to which any word or phrase may be inserted, 
should be avoided. Thus the question, “То what social organ- 
izations does he belong?” is usually less desirable than a com- 
prehensive list of social organizations to be checked, including 
the catchall “Others,” to cover any institutions that may have 
been omitted from the list. The ability of informants or 
records to furnish sufficiently accurate answers should be con- 
Sidered. Questions that call for more information than is 
likely to be available, that rely too much on memory or on 
memory of the distant past, that cause fatigue, or that excite 
bias or involve personal interests, either are to be avoided or 
Special provisions are to be made to estimate, overcome, or 
Correct for the resulting errors. Questions addressed to an 
informant should also be inspected to see if they suggest their 
own answers (e.g, “Do you dislike to go to school?”). The 
schedule should not modify the behavior it is intended to meas- 
ure. Special care should be taken that the schedule is not so 
long as to weary or disgust the informants. If it has to be long, 
more than one interview should be allowed, and the informant 
Should be paid or otherwise made to feel that the time given to 
16 is worth his while. 

On page 38 is the enumerative check schedule used as а part of 
the National Unemployment Census of 1937. Its purpose wasnot 
to test any hypothesis, but merely to check the number of unem- 
Ployed persons enumerated by the voluntary registration plan. 
It meets all the requirements mentioned above, except that it 
employs a number of questions such as “ Does he usually work for 
Pay?” the clarity and meaning of which are not obvious. 


Tun ENUMERATIVE CHECK SCHEDULE 
INSTRUCTIONS 


Household Information 


А. Location—Give address fully, including apartment number, floor 
number, rear, alley, etc., if necessary to identify the household. 
‚ Does this household live on a farm?—Consider as а farm any tract of 


land locally so regarded. 
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С. Total number of persons in this household.—Include all persons living in 
the same household unit, including servants and lodgers, also children and 
others temporarily away from this household. 

D. Number less than 14 years of age.—Enter total number of persons in 
this household who are less than 14 years of age. 


Questions About Each Person 


Name.—Before making the entries in any other column, list the names of 
all persons 14 years of age and over, then check with items “С” and “р,” 
above, to account for every person in the household. 

Write each name on a numbered line; never crowd additional names 
between lines or at bottom of form. For households with more than ten 
members 14 years of age and over, continue the listing on a second form, 
repeating the address. 

Column 1. Sez.—Enter “ М” for male and “F” for female. 

Column 2. Color or race.—Enter “W” for white, “Neg” for Negro, and 
“О” for other. Enter persons of Mexican parentage as "white" (W). 
The “other” (О) group includes Indians, Chinese, etc. 

Column З. Аде at last birthday.—If the exact age is not known, enter the 
approximate age. 

Column 4. Was this person working for pay (or profit) during the week of 
November 14 to 20, 1937?—Enter “Yes” for each person who worked for pay 
(salary, wages, fees, commission, supplies, living quarters, etc.) or who 
worked for profit (in his own business, store, or on his own farm) at any time 
during the week of November 14-20. Enter “Yes” for each part-time 
worker, even though he worked only а few hours each day, or only a few 
days of that week. 

Enter “Мо” for each person who was NOT working for pay or profit, as 
defined above, at any time during that week. In addition to persons who 
were totally unemployed, **No" should be entered for the following classes 
of persons: 

a. Housewives and other unpaid persons engaged only in housework or 
helping without pay in a family business or store or on the family farm. 

b. Sons, daughters, or other relatives who, without pay, help some mem- 
ber of the household in his work for pay or profit. 

c. Full-time students, and retired or disabled persons. 

4. Persons who had jobs but who were temporarily absent from work 


during the entire week because of sickness, strike, vacation, or othersimilar 
reasons. 


6. The Instructions.—To deal adequately with the definition 
of the terms used in a schedule, it is customary to accompany 
the schedule with a set of instructions, like those that follow 
the check schedule of the National Unemployment Census on 
page 39. A reading of these instructions will give an idea of 
the extent to which they may improve the accuracy of the 
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returns. In work of this kind there is, of course, always a 
practical limit beyond which the matter of definition cannot be 
carried. 

7. The Tables.—It is usually impossible to set up a schedule 
with much confidence unless tables to receive the returns are 
made up at the same time. Just what summary statistics are 
wanted should be listed (e.g., means, proportions, correlation 
coefficients), and the tables needed to compute and exhibit 
them drawn up, together with a transcription sheet or cards to 
which all of the data will be transferred from the schedules. 

Three of the many tables that were used in connection with the 
enumerative check schedule of the National Unemployment 
Census are shown below. 


TABLE 2.—Prrsons ENUMERATED IN CHECK AREAS AS PARTLY 
UNEMPLOYED OR as Part-time Workers, BY Sex AND Hours 
Worxep purine тне Week or Nov. 14-20, 1937* 

(Data for persons 15-74 years of age) 


Part-time workers 


Partly unemployed 


Total | Male | Female} Total | Male | Female 


Това........... 8,909 
Reporting. .............- 388| 6,538| 5,580 
ПТР А 14 9 
1-8 hours 434| 759 
9-16 hours....... 1,425 
phe 1201 
25-32 hours.. 129 
33-40 h 849| 454 
S от Й 240| 147 
41 hours or тоге....... 4 A35 
Not reporting..... ie ) 


Per cent reporting... 


None 
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nd Partial Unemployment, 1937, 
mployment, Unemploy- 
Washington, 1938. 


* 
From Depnrcx and Hansen, Final Report on Total a 
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TABLE 3.—Persons ENUMERATED IN CHECK AREAS Аз Мот AVAILABLE 
ков EMPLOYMENT, By Sex, Озоль Worx STATUS, DESIRE FOR 
Worx, AND ABILITY TO WorK* 

(Data for persons 15-74 years of age. Percentage not shown where less 


than 0.1) 
Both sexes Male Female 
Per Per Per 
Num- |cent ой Num- |cent of! Num- |cent of 
ber |popu-| ber | popu- ber |popu- 
lation lation lation 
Total not available for em- 
ployment........ eee 608,460| 41.5 |102,991| 14.2 505,469| 68.3 
Wanting but not actively 
seeking ТОРЕ, ае 21,108| 1.4 9,222) 1.3 | 11,886] 1.6 
Usually work 14,082) 1.0 7,491) 1.0 6,591) 0.9 
Do not usually work..... 7,026] 0.5| 1,731| 0.2 | 5,295) 0.7 
Wanting but unable to 
NOTES d aere penis 3,471| 0.2 | 2,264 0.3 | 1,207 0.2 
Usually work 2,668] 0.2 | 1,868| 0.3 800] 0.1 
Do not usually work..... 803 396 407 
Not wanting and do not 
usually work........... 583,881] 39.8 | 91,505) 12.6 [492,376] 66.5 


* From Deprick and Hansen, Final Report on Total and Partial Unemployment, 1937, 
Vol. IV, p. 33, The Enumerative Check Census, Census of Partial Employment, Unem- 
ployment, and Occupations, United States Government Printing Office, Washington, 1688. 


As a result of constructing specific tables, the original schedule 
is likely to be considerably amended and improved, especially 
if a complete set of tables is made covering every important step 
in the treatment to which the data are to be subjected, including 
all work tables for the statistical analysis. 

8. Testing the Schedule.—After a schedule has been tenta- 
tively constructed, it should be tested for accuracy, reliability, 
and, if necessary, for validity. 'This applies to each question 
separately and to the schedule as a whole. 

Accuracy may be checked by. applying the schedule to known 
data, and noting how closely the returns agree with the a priori 
information. The interviewer employed should have no prior 
knowledge of the data, and should not be aware that a check is 
being made. It is also sometimes possible to include in the 
schedule pairs of questions that get the same information in 
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independent ways; but this is usually confined to a few of the 
most important but least reliable questions. 


TABLE 4.—Garnrut Workers, 1930, AND Persons EMPLOYED OR AVAIL- 
ABLE FOR EMPLOYMENT IN ExuMERATIVE CHECK Arps, 1937, вх Sex 
AND RACE As PERCENTAGE OF POPULATION* 

(Data for persons 15-74 years of age) 


Both Sexes Male Female 
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port оп Total and Partial Unemployment, 1937, 
ensus, Census of Partial Employment, Unem- 
rnment Printing Office, Washington, 1938. 
d States, Population, Vol. V, p. 117. 


* From Deprick and Hansen, Final Re 
Vol. IV, p. 35, The Enumerative Check С 
ployment, and Occupations, United States Gove! 

+ Data derived from Fifteenth Census of the Unite 

Reliability is measured by trying the schedule twice on essen- 
tially the same data and comparing the results. It is often 
impractical to apply the schedule more than once to the same 
informant without introducing the memory factor or causing 
an undesirable response. Probably the best that can then be 
done is to apply the schedule to two random samples from the 
same universe of informants, and compare the returns. The 
same interviewer or interviewers should be used in each case. 
In all such tests, the differences observed should fall well within 
the range of random sampling error.* 
‚ A schedule, а part of the schedule, ore 
in the schedule, need to be tested for validity when it is not clear 
that they measure what is intended to be measured. This is 
invariably the case when broad concepts are involved. For 
example, if a schedule is designed to discover the number of the 
“unemployed” in the United States as of a certain date, it is 
advisable to give careful consideration to the matter of validity. 

enever a recognized and proved scale for the same purpose 


1 See Chap. XII. 


or one or more questions 
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already exists, all that is required is to find the amount of agree- 
ment between the returns from the two instruments, as used 
on the same data. As a rule, however, this convenient situation 
does not occur: there is no true criterion by which to test the new 
instrument. 

In many cases the proper approach is simply that of finding 
an acceptable definition. With the help of anticipated users of 
the research, the investigator defines (1) what “area” of meaning 
of a term (e.g., the “‘unemployed’’) should ideally be measured, 
(2) what parts of this area it is practicable to measure reliably 
enough for the purposes of the inquiry, and (3) what parts it is 
not feasible to measure. The meaning that should ideally be 
measured is the meaning that it is wanted to measure. The 
investigator then tries to find objective and reliable indexes, 
which, by agreement, cover as much of the desired meaning as 
possible. The remaining part that is not covered should then be 
clearly recognized by the investigator and his public, and both 
should regard the omission as not serious enough to invalidate 
the study. Of course, the public may sometimes be the investi- 
gator’s scientific colleagues, sometimes social welfare agencies, 
and sometimes the general public. Or the investigator may 
merely interpret the interests of the public as he thinks best. 

It will frequently happen that the persons representing the 
consumers of the research will differ in what they want measured. 
In such a case, the choices are (1) to try to include all the desired 
parts of the meaning in a single index, (2) to use separate indexes 
for different parts of the meaning, or (3) to omit some parts of the 
meaning, and thereby reduce the number of people who will be 
satisfied with the results. 

One advantage of the method of setting up an inclusive or 
ideal definition of the area of meaning to be measured and then 
marking out how much of it the given instrument can reasonably 
be expected to measure is the fact that it may be possible in 
later studies gradually to expand the area measured until con- 
sumers finally agree either that the result is a satisfactory index 
of the total meaning, or that the part omitted is so intangible 


1 The Census of Partial Employment, Unemployment, and Occupations: 1937, 
whose schedule is shown above, included persons totally unemployed and 


wanting work, emergency workers on WPA, NYA, ССС, etc., and persons 
partly employed. 


E. LL а 
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and so little agreed upon that it can be disregarded. It is also 
better to know approximately what the instrument does and 
does not measure, i.e., how useful it is for its purpose, than to 
say merely that “it measures what it measures p» 

If the method of cooperative definition has not been used, or 
if its results are not entirely satisfactory, {һе question still 
remains whether the index measures what it is wanted to measure. 
Even in the more objective and simple instances, this is not 
always certain. Thus, if we are trying with Thorndike to 
measure the desirability of cities as places of residence,! and 
include urban death rates as an objective index covering one 
aspect of the concept of desirability, we shall need to ask if the 
rates have been standardized for differences in the age and sex 
composition of the city populations, if the out-of-town deaths 
Occurring in local hospitals have been omitted, and so on, before 
we can be sure that the rates reflect differences in the incidence 
of fatal diseases and accidents between cities. In cases like this, 
the validity may be taken as established when our several 
questions are properly answered. But in dealing with less 
objective traits, this may not be enough. Suppose we include 
an attempt to measure the subjective trait of “friendliness” as a 
further element in the desirability of cities as places of residence. 
By the method of the cooperative definition outlined above, 
we may arrive at a combination of the average number of social 
Visits and the percentage of the population belonging to social 
organizations as a tangible index of this subjective quality. A 
Potential consumer of the investigation who has been consulted, 
however, may say that he has lived in several cities and found 
the people in some much “colder” to newcomers than in others, 
апа he doubts that the index will show this difference. If the 
Consumer from personal experience can classify certain cities as 
“colder to newcomers” than others, we can apply our index of 
friendliness and sce where it places them. If the results are in 
agreement with his observation, he is likely to accept the index. 
Of course, in such cases, the experience or opinion of a single 
individual is not enough. We should actually need to have many 
Persons, representative of our public, rate or score 8 des of 
Cities in regard to “coldness to strangers,” and compare their 

1E, L. Tuornprxe, Your City, Harcourt, Brace and Company, Inc., 
New York, 1939, 
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ratings or scores with the results of our index of friendliness. 
In doing this, we should be careful to choose as raters individuals 
who are well acquainted through actual residence with at least 
some of the cities in question. 

Moreover, the ratings when repeated by the same or like 
groups should give essentially the same results. If the several 
raters show little agreement among themselves, as may happen, 
no criterion at all will result from this procedure. In that case, 
we may need to face the problem of the average. Probably a 
certain city was actually “cold” in its treatment of some of the 
raters and not of others. We might then have to devise a 
reliable score that would reflect the proportion of the raters who 
regarded the city as “cold,” or the amount of “ coldness” that 
they experienced there on the average, and relate it to our index. 
Or we might feel it advisable to stratify our raters by socio- 
economie classes (c.g. rich, average, poor), and get separate 
ratings from each class. The latter plan would require us to 
deal with the whole problem of the desirability of cities as places 
of residence from the point of view of each social class separately, 
which should provide a set of indexes of more value than any 
single index representing a gross average for all classes. In 
addition to subjective ratings, we might also set up, preferably 
by agreement, certain objective criteria of friendly or unfriendly 
cities, such as their methods of dealing with unfortunates, that 
are not included in our index, and test the latter against them. 
3 The final test of such an index, of course, is whether in practice 
it proves more useful than other methods in selecting cities that 
people will actually find desirable or undesirable places of 
residence, in accordance with the prediction of the score card. 

Ingenious ideas can often be used in testing the validity of 
an index. For example, if we are measuring attitude toward 
religion, we might see if our scale will place a group of ministers 
at the favorable end, a group of atheists at the unfavorable end, 
and average citizens for the most part in the middle. In work 
of this sort there are, however, many pitfalls that can be learned 
only from experience. 

Such tests of accuracy, reliability, and validity as mentioned 
above imply that the schedule will be tried out in the field on a 
small scale and carefully revised in the light of the results before 
it, the instructions, and the tables are put in final form. This 
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preliminary trial almost invariably leads to some important 
changes, and should rarely be omitted from the routine of 
statistical research. 

9. The Interviewer.—After the schedule has been carefully 
prepared and tested, the purpose of the interviewer or data taker 
is merely to see that the questions are understood and answered 
to the best of the ability of the informant, or that the right data 
are accurately copied from the proper sources. 'The less the 
interviewer says or does beyond this, the more dependable the 
returns should be. He must be especially careful not to suggest 
answers to the informant, or to bias him in any way. While 
this may seem to be a negative role, it is one that calls for skill 
and judgment. The ability to induce informants of various 
kinds cheerfully to give accurate information, or to extract data 
without error from complex or ‘confused records, is not common. 

It is often desirable to test the results obtained by each 
interviewer by noting whether an interviewer's returns differ too 
much from those of others reporting similar data. Also, when 
the schedules are edited, certain kinds of errors made by the 
interviewers may be noted. The in 
cautioned, or their work may be corrected for the personal 
equation. 

In the gathering of information by schedule, several interview- 
ers or clerks may be supervised by a foreman, or the investigator 
may do all this work himself. In any case, the investigator 
Should participate in the actual field or library work at least 
enough to acquire a firsthand knowledge of the conditions under 
which the data were obtained, and a “feeling” for the data, asit is 
termed. Many an investigation has been saved or lost by the pres- 


ence or absence of the analyst during the data-collecting process. 
10. Editing the Schedules.—The schedules filled out during 
nerally sent in to а group of 


each day on a large study are ge roup 
editors at the headquarters of the study. Under the direction 


of a chief, these clerical workers look for unfilled spaces, for 
inconsistent answers, and the like, on each schedule. Where 
Necessary, a defective schedule is returned to the field or library 
foreman, who in turn hands it to the interviewer or the clerk 
whose initials appear on it. In small studies, the schedules 
taken during the day are often edited by the interviewers them- 


Selves each night. 
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11. Tabulation of the Data.—Edited schedules go to tabula- 
tors, who tally the data from the schedules to the tables, or, 
if machine methods are used, punch them on cards according to а 
code arranged for the purpose. Machines are, of course, faster 
and more economical for large-scale tabulation. 

The chief electrical machines now in use are the card punch 
and verifier, the sorting machine, and the tabulating machine. 
A general idea of what each does may be obtained from the 
following description: 


The first step in the use of a card for a particular record is the desig- 
nation of groups of columns as "fields." Each field defines a section of 
the card in which one particular type of information will always appear. 

The illustration following (Fig. 3) shows an 80-column card partly 
drawn up into fields. Each field is assigned a sufficient number of 
columns to include the largest number of digits which it will be called 
upon to accommodate. 


23 
41414644400 L0 4444 4 4 ТК Л АЛ G4 444444400000 
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3333333539233932333335333333331111 
БЕНЕН 


a 
ооо номов или KAR ARA 


Fic. 3.—Eighty-column tabulating card. 


For instance, the greatest number of months is 12 (a two-digit 
number), therefore, two columns are sufficient for recording this informa- 
tion. The greatest number of days in a month is thirty-one, thus this 
field too requires only two columns. The year is indicated by the last 
two digits, making two more columns necessary, etc. 

Figure 4 illustrates (a 45-column) card completely laid out for a 
specific job, in this instance a complex (criminological) study. 

At this point it is obvious that all pertinent information must be 
registered in the card in the form of punched holes. The perforation 
of these holes is a simple matter. The digits of the numbers to be 
transcribed correspond to the digits printed on the card. Thus, to 
show the date Oct. 15, 1934, on the card illustrated above (Fig. 3), 
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the card is perforated as follows: 10-15-34. Descriptive information, 
such as the names of persons or products, is generally coded numerically. 

Tabulating cards are perforated by means of an electric punching 
machine. The punch designed for the numerical system has a keyboard 
consisting of twelve keys, one for each punching position of a column. 
As a key is depressed a hole is cut and the card advanced automatically 
to the next column to be punched. The automatic features of the 
machine and the simplicity of the keyboard make the transcription of 


written data into punched-hole form easy, rapid and efficient. 
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Fia. 4.—Forty-five column card with field headings. 

eted, the cards are usually in miscel- 
o arrange them in sequence by some 
m according to some infor- 
d-operated Sorting Machine 


When punching has been compl 
laneous order. The next step is t 
desired classification—that is, to group the 
mation which is punched in them. The Car 
18 used for this purpose. 


The operation of the Electric Sorting 
tion of the punched hole in а vertical column of the card. Аз the cards 


Pass through the machine & prush contact is made through the hole, 
causing an electrical circuit to be closed. This momentary circuit causes 
the card to be directed to а receiving pocket which corresponds Корр 
Position of the punched hole. For example, a card punched “9” in 
the column under consideration is directed to the 9 pocket, а card 
Punched “6” in the same column is directed to the 6 pocket, etc... - 

The automatic sort is made on one column at а time. It is apparent, 
therefore, that to arrange а group of cards in numerical sequence accord- 
Ing to the data punched in & three-column field, the group 18 раа 
hrough the sorting machine three times. The sort is made first on the 


Machine is based on the posi- 
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units column, then on the tens column and finally on the hundreds 
column. The Card-operated Sorting Machine is entirely automatic 
and operates at a speed of 400 cards per minute. 

The third step in the Punched Card method is the automatic com- 
pilation of the data into printed reports. This is accomplished by the 
Electric Tabulating Machine which is a combined adding, subtracting 
and printing machine. Punched cards passing through this machine 
actuate the various adding counters and printing mechanisms—again 
by means of electrical contacts . . . The machine is entirely automatic, 
operates at a speed of 150 cards per minute . . . ! 


12. Analysis of the Data.— The analysis of the data should 
proceed along the lines laid down in planning the study, although 
any minor modifications or extensions that later appear advisable 
may be made. This means that the data have already been 
put in work tables for computing means, percentages, standard 
deviations, correlations, or whatever other statistics are needed 
for simplifying and interpreting the findings. After these 
statistics have been worked out and their accuracy has been 
carefully checked, the investigator should state the results as 
simply, briefly, and clearly as he сап. Only a few of the most 
vital tables should be presented with the text of the report, all 
others that seem desirable being placed in an appendix. Where 
graphic devices promise to be effective, they can be introduced. 

Perhaps the most important things to keep in mind at this 
crucial stage of an investigation are to limit the conclusions to 
what the data show, while yet seeking to use enough imagination 
and insight to discover all of the pertinent information that may 
be extracted from the findings. There are, of course, no rules 
by which this can be done. Everything depends upon the 
ability, integrity, training, and persistence of the analyst. 

13. The Amount of Error of Observation or Record in Sta- 
tistical Results.)—'The readers of sociological studies are not 
unreasonable when they express an attitude of skepticism 
toward the elaborate precision of some of the statistical tech- 
niques that are frequently applied to social data of doubtful 
character. 


!HznBERT ARKIN, іп С. W. Bachne, ed., Practical Applications of the 
Punched Card Method in Colleges and Universities, pp. 4-8, Columbia Univer- 
sity Press, New York, 1935. 

? Adapted from a paper by T. C. McCormick, On the Amount of Error in 
Sociological Data, American Sociological Review, Vol. 3, pp. 328-332, 1938. 
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The major difficulties involved in the estimation of errors of 
observation are practical rather than mathematical and theo- 
retical in nature. Determination of the accuracy of findings is 
first a question of funds and time, and is tied up with administra- 
tive policies. 

In England, the importance of estimating errors of record in 
sociological results has been recognized by Arthur L. Bowley, 
who writes: 


If we do not know of the existence of biassed errors, which in reality 
pervade our estimates, there is no remedy; if we know them, we are 
likely to obtain more accuracy by the most erroneous corrections for 
them than by neglecting them. . . . In the nature of things, when we 
are dealing with errors we do not know their magnitude; the most we 
can know is their probable and possible extent. We might estimate, for 
instance, the percentage of unemployed in a certain year as 4.5, and 
add, from information in our possession (coming from a study of wage 
bills or the reports of relief agencies), that we considered this to be 
within .5 of the fact; we should then write the number 4.5 + .5, meaning 
that the error in the estimate as defined above was unlikely to be 
more than .5/4.5 = 4, or 11 per cent, the corresponding absolute error 
being .5. In such a case we can also give definite limits. The per- 
centage employed must lie between 0 and 100; and if we could actually 
enumerate 1 per cent of the working-class as out of work, and also 92 
Per cent as in work, we should know that the number required was 
between 1.0 and 8.0 per cent, and the maximum error in our estimate, 4.5, 
Was 3.5/4.5 = 4, ог 78 per cent. Even this is more precise than the 
original Statement, “ће percentage is 4.5, error unknown.” By further 
Investigation. we might perhaps bring the limits of error nearer to each 
other, and decide that it was practically certain that the percentage 
required was between 3.5 and 4.5; then we ought to say “the number 
unemployed is .04 . . . of the working class, the estimate being correct 
to the last figure given.” This statement is of the same nature as, “The 
body weighs 151b. 3 oz., correct to an ounce.” t 


Аз yet, most of the theory underlying the subject of errors 
Consists of a number of precautions that simply need to be borne 
in mind and observed. What seem to be the outstanding points 


are briefly summarized below. 


80, 181, 192, Charles 


LAL, Bowrey, Elements of Statistics, 6th ed., pp. 1 iu 


Beribner's Sons, New York, for P. S. King & Son, Ltd., London, 
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(1) By definition, “The relative error in an estimate is the 


(2) 


(3) 


(4) 


(5) 


ratio of the difference between the estimate and the true 
value, to the estimate." 
Where the necessary a priori information exists, the 
results of an investigation may be compared with expecta- 
tion, and the extent of the error suggested in this way. 
The basis of the expectation must, of course, be justified. 
In the absence of adequate comparative data, the only 
possible method of finding errors of measurement or 
record is to repeat the measurements, or a sufficient propor- 
tion of them. These check measurements may be made 
with the same measuring instruments, or by other devices 
and approaches, to reveal possible errors due to a particular 
method or scale. A change of personnel to find the 
amount of error attributable to the “personal equation” 
is also important. 

Where differences between the original and the check 

measurements are found, investigation should continue 

until it is possible to correct the error sufficiently for the 
purpose in hand by averaging or other estimate. 

There are two well-known kinds of error of measurement 

or record, whose treatment is different: 

a. Unbiased or compensating errors. Some errors occur 
in opposite directions, and so wholly or partly cancel 
out in sums, averages, and other statistics. Such 
random errors, however, increase the value of the 
standard deviation and attenuate the correlation 
coefficient.! 

b. Biased errors, or errors in the same direction: 

(a) Constant error. An error that remains the same 
from one measurement to the other, as when a foot 
tule is inaccurately divided, is usually hard to 
detect, but very common. In social investigation 
it may be due to wishful thinking, to loose definition, 
to falsification on the part of the subjects inter- 
viewed, and so on. 

(b) Accumulative error. Some biased errors increase 
from measurement to measurement, as when one 


1 Зее Chaps. VIII and X for definition of these terms, 
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is dealing with more and more difficult material. 
Thus, in taking the census, it is less easy to get 
accurate answers to certain questions from Negroes 
than from whites. 

Irregular noncompensating error. When measure- 
ments vary erratically, so that they affect sums 
and averages in important but unpredictable ways, 
the error must be estimated or eliminated in each 
separate measurement. 


(c 


~ 


Apart from ingenuity and perseverance, there is no formula 
for finding such errors as these. Where they are suspected 
but not discoverable, it may be advisable to express results 
in the form of ratios, since biased errors are reduced in ratios and 
index numbers. As Bowley puts it, “The error in a ratio is 
approximately the difference between the errors in its two 
terms, .... ” 

In social investigation it is especially important to avoid 
misleading accuracy of statement, such as carrying calculations 
based on crude data to two or three decimal places. The problem 
of how far not to carry significant figures should invariably be 
Solved on the conservative side, as when, in rough population 
estimates running into the millions, even the tens of thousand 
places are given to zeros, and the hundreds of thousands are 
rounded off. 

The final statement of an average or other statistic should 
include the maximum amount by which it may reasonably be in 
error, expressed as a percentage of the value of the statistic, 
аз already mentioned above. For example, given the annual 
church attendances per individual, 58. The error of record in 
this figure is estimated to be 10 per cent. These facts may be 
expressed in some such form as 58 + 10%. ү 

As Bowley warns, it sometimes takes longer to estimate the 
approximate amount of error in the results of a study than it 
does to make the study itself. If sociologists give proper 
attention to the accuracy of their findings, therefore, they are 
Certain to be forced by the interests of economy of time and 
Money to simplify their problems and to investigate the same 
Population as often as feasible. This is true if accuracy is 
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regarded as a purely relative thing, which need be no greater 
than is required to obtain a satisfactory answer to a question in 
hand. 


Exercises 


1. What use may be made of nonquantitative methods in statistical 
research? 

2. Make a list of the main requirements of a well-chosen statistical 
problem, and give illustrations of what you consider good and poor, 
with your reasons. 

8. With which of the chief sources of secondary statistical data in 
the United States are you acquainted? 

4. Select from the latest United States Census a few definitions that 
seem to you (а) satisfactory, (b) unsatisfactory, and explain why you 
think so. 

5. What are some of the most unreliable counts in the United States 
Census of Population, and why? 

6. Collect instances of studies in which a questionnaire was mailed 
out and report on the proportion and representativeness of the returns 
received, 

7. а. In the statistical laboratory, propose problems on a competitive 
basis; and after a problem has been chosen, help design a study which 
your class in social statistics will carry out as a semester’s project. 

b. Does the problem satisfy the requirements that you listed under 
question 2 above? 

c. Indieate by which of the methods described in Chap. II the most 
important traits or factors concerned in this study will be measured, 
and show that no more exact measurement is feasible, 

d. What is the dependent variable? 

e. What are the main independent variables? 

f. What are the important interfering factors? 

g. How will the interfering factors be controlled? 

h. Is the sample adequate in size? 

Т. What is your assurance that it is representative? 

1. Does the schedule meet the demands mentioned in this chapter? 
Review the points. 

k. Do you have all the tables that will be needed for computation 
and exhibition purposes, and for interpreting the data? 

l. Do your instructions leave any important terms undefined, or 
any procedures unexplained? 

т. By what methods do you propose to test the reliability and, if 
necessary, the validity, of your schedule? 
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n. To what extent have you used the method of cooperative definition 
to improve the validity of your indexes? . 

0. Will you try to measure the error due to the personal equation of 
the interviewers? 4 

р. How will you estimate the amount of error in your final results? 
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PART II 
Statistical Methods 


СНАРТЕВ У 
т 
ABULATION OF FREQUENCY DISTRIBUTIONS 


1. 
бай Ж = Е large groups of figures of any kind 
НЫ s ied and interpreted, they must be arranged, or 
i s A к some orderly and meaningful way. 
а e: wee in the tabulation of statistical data, let 
cm E T пе sizes of sibling families from which the students 
by “sizes Е. ege come. А definition is needed. What is meant 
ber of dese families"? Let us say that we mean the 
sibling Net rothers and sisters, including the student. The 
is the unit г then, is the thing to be measured, while a sibling 
counted? о ориз ог measurement. Are siblings deceased to be 
of а dented 1 at of siblings married and moved away? What 
feared in d or other children not brothers or sisters 
еы е family? Always such questions of definition 
arise in iste to be measured and of the unit of measurement 
and ur eginning of a careful inquiry, statistical or otherwise, 
In the E. e settled with the purpose of the investigator in view. 
away he ha case, let us say that deceased siblings, siblings 
Ethie fami] ome, and children adopted or reared as siblings 
ico y shall be included. 
Measured g that we have defined the thing to be counted or 
sibling m sibling family, and the unit of count or measurement, а, 
Dose, А B that the units are equal and equivalent for our pur- 
сі us im en ask each student to tell the size of his sibling family. 
3 imagine that 200 students give the following sizes of sibling 
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2. The Frequency Distribution: Discrete Variable.—We have 
here 200 values, varying from 1 to 15. So far, the answer to our 
wish to know the sizes of sibling families to which the students 
belong is rather confusing. The range, or spread between the- 
smallest and the largest values is the clearest bit of information 
we have. It extends from 1 to 15, and is therefore 14. We 
should also like to know how many families of each size there are. 
Аз a preliminary step to this end, it is convenient to put the 
items in the form of an array, which means merely putting them 
in order of size. i 


TABLE 5.—Array or VALUES 
1111222222333344556 8 
1111222222333344556 8 
1111222222333444566 9 
1111222222333444566 9 
111122222333344456710 
1111222223333444567 11 
1111222223333445567 12 
1111222223333445567 12 
1111222223333445568 14 
1112222223333445568 15 


As а гше, however, a better form of the array is the frequency 
array. It is obtained from the original data by setting up a 
consecutive series of numbers covering all the observed values 
(here, the sizes of sibling families, 1, 2, 3, etc.) in the left-hand 
column of Table 6, and tallying in the right-hand column the 
number of times each consecutive value (size of family) occurs. 
The latter figures are termed frequencies. 


1Tallying is commonly done by making in the proper row a sloping 
stroke for each item (e.g., family) until four strokes are made, then drawing 
a stroke through them for the fifth item: OK S ///. The tallies 
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TABLE 6,—Frequency ARRAY OF VALUES 


Size of Students Reporting 
Sibling Family Frequencies 
1 39 
2 55 
3 38 
4 24 
5 16 
6 12 
7 4 
8 4 
9 2 
10 1 
11 1 
12 2 
13 0 
14 1 
15 БЫЛ, 
Total. Soson ое АЛАЛЫ 200 


This gives the same information as Table 5, but їп a much 
More compact form. Table 6 also satisfies our curiosity relative 
to the number of students reporting each size of sibling family. 
We see at once that most students are members of families of 
three or fewer siblings. à 
. We shall next try lumping together into classes, or class 
tervals, more than one size of family, with the double purpose 
of showing more smoothly how the students are grouped with 
Tespect to size of family, and of more easily calculating averages! 
and other statistics from the table. Imagine combining into . 
` Classes family sizes 1 and 2, 3 and 4, 5 and 6, 7 and 8, and so on. 
We then get Table 7. 

he work of combining the frequencies should be carefully 
checked by repetition, and it should be noted that the total is 


he same as for Table 6. 


are next counted, and a figure representing the total number of items 
ae le row is entered in the frequency column of the table. The work of 
tallying should be repeated, as a check, and the total should agree with the 
number of original items (sibling families). When machine methods of 
lying are used, the sorting machine counts the frequency in each class, 
чн the resulting totals are simply read off and entered in the table. 

ut the average found from Table 7 will be less accurate than that found 


from Table в, 
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TABLE 7.—Frequency DISTRIBUTION or Data 
Size of Students Report- 
Sibling Family ing (Frequencies) 
1 and 2 


Table 7 is still more concise than Table 6 and the distribution 
of the frequencies is more regular. There are no classes of zero 
frequencies, but instead a rather steady decline in the number of 
cases as the size of family increases, which is what one would 
expect. 

Tables 6 and 7 are called simple frequency distributions, or 
merely frequency distributions, because they show the frequency of 
occurrence of a set of values arranged in order of size. Table 6 
was also called a frequency array because successive class values 
increased by single units. 

In Table 7 the question arises, What is now the size of family 
in each class? In the first class, is the size of family the average 
of 1 and 2 = 1.5? This is more reasonable than to say that the 
size is either 1 or 2. But how cana family consist of one person 
and a half person? Is not this taking liberties with the data? 
The trouble is due to the circumstance that we are dealing with a 
discrete series, i.e., a series that can take only certain values (whole 
numbers) and no intermediate values. Thus a sibling family . 
may contain 1, 2, or 3 members, but not 1.3, 2.7, or 3.6 members, 
because people always come in wholes! In contrast to a discrete 
series is a continuous series, in which the variable! may assume 
any whole or decimal value whatever. The ages in years of the 
students in a sample represent a continuous series: 19.3, 20.4, 
20.6, 21.2, 21.7, 21.9, 22.1, 22.5. While a continuous series can 
always be mathematically averaged without logical offense, 
this is not true of a discrete series. For example, the ages in 
years of five students are 19.3, 20.4, 21.7, 21.9, 22.1, and their 


1A quality (e.g., sibling family) that varies in size or amount, 
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average (arithmetic mean!) age is 21.08, which is a possible value. 
But if five sibling families are of sizes 1, 2, 2, 3, and 5, respec- 
tively, the mean is 2.6, which is a fictitious value. We are thus 
faced with the dilemma either of disregarding the logical nature 
of a discrete series, or of abandoning the attempt to analyze it 
in terms of averages and other mathematical concepts. Since 
the purpose of an average is to simplify and represent a series, а 
fractional value may serve this end in the case of a discrete series, 
even though it is not strictly realistic, and many valuable facts 
can be discovered in this way that otherwise would not appear. 
For these reasons, discrete series are usually thrown into fre- 
quency distributions and treated in some ways as if they were 
continuous. 

Returning now to Table 7, we may regard the average value 
of the two sizes of families grouped together in each class as the 
mid-point of the class (e.g., for the first class of Table 7, the 
mid-point is 1 + 2 _ 1.5) When any item is placed in a class 
With other terms, it is understood that it thereupon exchanges its 
original value for that of the mid-point of the class. For exam- 
Ple, when a family of 4 siblings is placed in the class 3 and 4 in 
Table 7, the 4 is thereafter treated as if it were 3.5. The mid- 
Points of any class should, therefore, always be as close as 
Possible to the true average of the items included in the class. 
From Table 6 we sec that the true weighted? mean size of the fam- 
(39 X 1) + (55 X 2) 2 149 = 1.585, 
РЕ К ЧО: і 


ilies of 1 апа 2 siblings is 
whereas our mid-point is 1.5. This is rather close agreement, 
and may be satisfactory for our purposes. The mid-points of 
other classes may be similarly tested. From Table 6 it can be 
Seen that the mid-point of the first class in Table 7 is too small, 
cause there are more families of 2 than of 1; but this is some- 
What offset by too large a mid-point in the next class; and so on. 
ere one error balances another in this way, the accuracy of the 
Mean found from the table is improved, although the mid-points 
of Some of the classes may not be too good. Recasting Table 7 
11 mid-point form, we have 
1 (19.3 + 204 + 21.7 + 21.9 + 2.0/5 = 
In the weighted mean, each value (e.g. 


Often as it occurs. 


21.08. See Chap. VII. 
size of family) is counted as 
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TABLE 8.—FREQUENCY DISTRIBUTION OF Data: Мто-РОГУТ Form 


Size of sibling family, | Students Product 
mid-point (X) reporting (f) (Xf) 
1.5 94 141.0 
3.5 62 217.0 
5.5 28 154.0 
7.5 8 60.0 
9.5 28.5 
11.5 3 34.5 
13.5 1 13.5 
15.5 1 15.5 
Total tran. ese. 200 664.0 


If we calculate the arithmetic mean from Table 8, we may 
compare it with the true mean found from Table 6. To find 
the mean, we multiply each mid-point by its frequency, sum 
the products, and divide by 200. This gives for the data of 
Table 8 а mean of 3.32, and for Table 6 a true mean of 3.315, 
which in this case are nearly identical. We may, therefore, 
approve Table 8 as far as this test is concerned. 

In the case of the data on sibling families, we need to show only 
the lowest and highest whole numbers that can fall within a 
class, because we are dealing with discrete or whole numbers. 
These upper and lower limits of a class are called class limits. 
We may set up the stub, or first column, of the frequency dis- 
tribution as shown in Table 7, or, if we prefer, we may write 1-2, 
3-4, 5-6, and so on. Frequency distributions are usually given 
in class limit rather than in mid-point form, but the latter is also 
common. The former is better suited for tallying, the latter for 
computing purposes. 

3. Selection of a Class Interval.— The Suggestions usually 
given to aid in choosing a class interval for untabulated data are 

1. Note the range of the data, i.e., the difference between the 
largest and smallest values of the variable. 

2. Decide about how large the interval has to be to make & 
significant difference in the data. For example, a difference of 
less than five points in a distribution of Students! grades would 
seem to be of no Consequence, since most teachers make no 
attempt to grade closer than that. Indeed, 10 points may 
seem to some sufficiently close. 
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` If the values already have some natural spacing, the latter 
should often’ be taken as the interval. For example, the size of 
farms in certain regions tends to be a multiple of 40 acres: 40, 
80, 120, 160, ete. 

3. Consider how many class intervals would result if the size 
of interval tentatively chosen in (2) above were divided into 
the range found in (1). As a rule, from 10 to 20 intervals are 
desirable, although, of course, more or fewer are permissible. 
Revise the size of interval suggested in (2) somewhat, if it seems 
advisable. 

4. Make all intervals of equal size, if feasible, and avoid 
Open end intervals when possible. 1 

5. Decide tentatively upon the mid-points and class limits of 
the intervals. Unless difficulty in classification is introduced, 
the mid-points should be whole numbers for convenience in 
Computing, and if they can be multiples of 5’s or 10's, so much 
the better, 

6. Tally the data in the class intervals chosen. Note whether 
the resulting distribution reveals а smooth trend in the fre- 
quencies from one end of the scale to the other, avoiding an 
irregular, broken effect. If too large an interval has been used, 
‘Some points of interest relative to increase or decrease of fre- 
uencies will be concealed. If the interval is too small, the 

istribution will lack smoothness. It is often necessary to try 
tabulations by larger and smaller intervals to decide these 
Points, t 

7. The accuracy of the class interval chosen for computation 
Purposes should be tested by calculating the arithmetic mean 
rom the table and comparing it with the true mean found from 

© ungrouped data or from a large random sample of the 
Ungrouped data, То obtain a class interval that will give 
Maximum accuracy it is often helpful to use а sliding scale 

°vice like that illustrated below (Fig. 6, applied to Fig. 5). 
е application of these suggestions will be illustrated. р 
elow is a list of the final grades of a class in statistics: 


80 81 87 83 94 
94 85 78 82 85 
85 81 87 87 65 
88 81 80 75 70 
73 63 77 80 88 
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78 73 68 79 
68 83 70 83 
72 84 74 88 
76 90 85 88 


Ordinarily it is not worth while to set up a frequency distribu- 
tion for 41 cases, but a small number is used here for convenience. 

The first step is to arrange these values in order of size, to 
form a frequency array. 


TABLE 9.—FREQUENCY ARRAY OF VALUES 


Frequency Frequency . 
(X) (7) (7) 
63 1 1 
65 1 3 
68 2 3 
70 2 1 
72 1 3 
73 2 1 
74 1 4 
75 1 3 
76 1 4 
77 iL 1 
78 2 2 
CI PET 41 


The range is 94 — 63 = 31. Intervals of less than five do not 
seem justified by the accuracy of the data. A natural grouping, 
or tendency for the grades to cluster about multiples of five, 
would be expected. There would be only three or four intervals 
of 10, which seem too few. A trial interval of five, with mid- 
points at 65, 70, 75, etc., is shown in Table 10. These mid- 
points are especially appropriate, because the clustering of 
the grades around them should increase the accuracy of the 
table for computing averages and other statistics, as illustrated 
above. Narrow class intervals are also generally more accurate 
for computation purposes than are wide ones. Like the data 
on sibling families, these percentage grades are given only in 
whole numbers, and so may most conveniently be regarded as 
discrete. If 65 is taken as the mid-point of an interval of 5, 
evidently the lowest grade that belongs in this interval is 63 
and the highest is 67, so that the five grades 63, 64, 65, 66, and 
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67 are included. If the width of the class interval were an 
even number, ‘such as 4, instead of an odd number, such as 5, 


x x 
XX. хи а 
x x x x x x xx x x хх x 
[Е х х X X X X X X X X X X X X x x xx x x 
63 
68 15 16 83 88 93 


Fic. 5.—Data of array on page 66 plotted on unit scale. 


a a RR CT [el aa 


Fra. 6.—Sliding scale, with trial interval of 4. 


the mid-points would be forced to take a decimal value, as was 
the case in Table 7. A 


TABLE 10.—FnrquENcY DISTRIBUTION OF STUDENTS’ GRADES 


Grades Frequency 
(X) zc (f) E 
FIRE» = 

65 63-67 2 130 
70 68-72 5 350 
75 73-77 6 450 
80 78-82 10 800 
85 83-87 11 935 
90 88-92 5 450 
^" 95 93-97 PUN. EA t 190 
p озше al 3,805 


* 
L means class limits. 


From inspection of Fig. 5, where each value has been plotted 
along the grade scale, it appears that the above choice of class 
б егуајв throws the mean below the mid-point in the intervals 

3-67, 68—72, 73-77, 83-87, 88-92, 93-97. Only in two intervals, 
JOWever, 88-92, and 93-97, is the lack of balance serious. In 
One interval, 78-82, the mean is at the mid-point. The true 
Mean of the series, computed from the separate values, is 80.195. 
S © mean found from the frequency distribution of Table 10 is 
a 610, which is 0.415 too high, as would be expected. If this 

Mount of inaccuracy is considered impor 
better class interval should be 
ding scale from 


ts as in Fig. 5 
be measured 


in 
hand, an attempt to obtain а 


m 2 ü 
Ade. This may be facilitated by making a sli 


Ordi: н 1 
nary coordinate paper, using the same unl 


(see Fig. 6). Class intervals of different sizes may 
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off on the sliding scale, and each tested in turn against the scale 
in Fig. 5. In the case of any given interval, the trial scale (Fig. 6) 
is moved along the fixed scale (Fig. 5) until the frequencies 
shown on the fixed scale are as evenly balanced as possible 
around the mid-points of the trial class interval on the sliding 
scale. If satisfactory, the values of the class limits may then 
be read off on the fixed scale from the intervals on the sliding 
scale when in this position of balance. Usually, some inaccuracy 
is inevitable in the use of class intervals. The problem is to 
keep it within such limits that no serious damage will result to 
the conclusions of the study. 


TABLE 11.—Вівтн RATES PER 1,000 іх 150 APPROXIMATELY Equau 


POPULATIONS 
Class limits | Mid- Frequencies 
points 
a) (2) (3) 
12.5-13.4 13 3 
13.5-14.4 14 15 
14.5-15.4 15 26 
15.5-16.4 16 31 
16.5-17.4 17 43 
17.5-18.4 18 25 
18.5-19.4 19 5 
19.5-20.4 20 2 
ГО АГ. tee га 150 
сеен ROS SUUS 


4. The Frequency Distribution: Continuous Variable.—A con- 
tinuous variable, such as birth rates, is tabulated in the same 
way as a discrete variable, except that a slight modification is 
needed in finding class limits from mid-points, and vice versa. 
In Table 11, given the mid-points of col. (2), what are the lower 
and upper values of each class within which the birth rates can 
be classified? The boundary line between any two mid-points 
should evidently be halfway between them—in this case $ of 1, 
or 0.5 unit above the lower or below the higher mid-point. We 
thus get as our class limits in col. (1), 12.5, 13.5, 14.5, and so on. 
Notice that the upper limit of any class is made slightly smaller 
than the lower limit of the class just above, to indicate that 2 
case falling exactly on the border line between two classes is 
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placed in the upper class rather than in the lower. Assuming 
that our original data carry only one decimal place, it is enough 
to write the upper limit of the class 12.5 to 13.5, for example, as 
13.4; but if the data carried two decimal places, the upper limit 
should be written 13.49; and so on. 

To find the mid-points, given the class limits, of the continuous 
Variable of Table 11, we add the lower limit of an interval to 
и lower limit of the interval next higher on the scale, then average 

ет. 


12.5 + 18.5 _ 13 


13.5 + 14.5 © 14 


` and so оп. 


5. The Frequency Distribution: Nonquantitative Variable.*— 
Let us imagine that, instead of adopting quantitative classes for 
Sizes of sibling families, as shown in Table 7, we had asked 
the students to state whether or not the size of their sibling 
amily was large, medium, or small, without telling them what 
Sizes of families should be placed in each of the three categories. 

€ might then get a table something like Table 12. 


Тлвив 12,—Sizp or SIBLING FAMILIES 


.. Size of Students 
Sibling Family Rape 
Small 18 
Medium... м d 
E 500 

Total 


ti We may now call attention to three requirements of с 
10 that were not mentioned in our previous work, althoug 
У were tacitly assumed. The first of these is that the Or 
pores must be mutually exclusive. The second is that they йш 
© €thaustive. The third is that there must be only one basis о 


Class; 1 
"5sification at a time, 


of a value that is identical with a class limit 


1 Th А 
зоа ©oretically, the frequency the two classes above and below; 


but ; Perhaps be divided equally between s ор 

a 11 most practical work 1E method suggested above is more convenie 
Wes 

3 и used in amount, but is not 


Meas nonquantitative variable is a quality that varies 
“ted in terms of units. 
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With respect to the last-named requirement, the basis of 
classification in Table 12 is size of sibling family. There is no 
evidence that any other principle was used in this table. A 
question may be raised, however, about the first requirement. 
If we checked to see, we should certainly find that some sibling 
families of three were listed as small and some as medium, and 
that similar errors were made in the case of families of other 
sizes. Moreover, if the sibling families reported above were 
entered in Table 12 by two independent investigators, even 
though neither happened to put families of the same size in two 
different classes, there is little chance that they would both 
classify each size of family in the same class. Some would regard 
a family of three as medium, others would regard it as small. 
Their finished tables would not show the same frequencies in each 
class. Because these difficulties of classification multiply with 
the number of classes, it is usually advisable to have very few 
classes in a qualitative table, e.g., three in Table 12. This 
limits the analysis to broad categories. 

Regarding the principle of exhaustiveness of classification, we 
need to ask: Were there any families that could not be classified in 
one of the three classes of Table 12? Apparently there were not, 
so the table passes this test. 

Can we calculate the mean of Table 12, as we did in Table 8? 
At once the question arises, what are the values of the mid-points 
in Table 12? Since the classes in this table are not quantitative, 
no quantitative values can be assigned to their mid-points. We 
therefore discover that we are unable to analyze a nonquantita- 
tive table by the use of the mean. All that we can do is to say 
that the modal class, or the class containing the largest frequency, 
is that of small families. 

From this illustration, we learn that nonquantitative tables 
not only are likely to violate the logical principle of classification, 
which requires that the several categories be mutually exclusive, 
but that they also do not lend themselves to the calculation of 
the mean and other basis statistical measures by which quantita- 
tive tables are customarily analyzed. For such reasons as 
these, quantitative classes are always to be preferred to qualita- 
tive for purposes of statistical analysis. The latter should be 
employed only where quantitative classes are not obtainable. 
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, 6. The Frequency Distribution: Table Structure..—The main 
eading of a frequency table is called the title; the left-hand 
qe with its heading, the stub; and the heading of the right- 
апа column, the caption. These are illustrated in Table 13. 


(Title) Taste 13.—Тнь Size or Эвыме Fames or 200 STUDENTS or 
Зостотосх, Вилхк Сомжск, 1939-1940 


(Stub) Siblings in Family (Caption) Students Reporting 

1-2 94 

3-4 62 

5-6 28 

7-8 8 

9-10 3 

11-12 3 

13-14 1 

15-16 m] 

иу, ч M EIC УС LAUR EC 200 


m far as feasible, a table should be self-explanatory, but 
ih Unnecessary word or figure should be included. The title 

Ould usually mention the variable in the stub first; the units 
ena caption and their number, second; and any further sub- 
ü isions of the stub or caption. It should also generally men- 
10n the date and place. ‘The purpose of the stub and the caption 


т 
ABLE 14. np Sum or Эвымо ЁАмплЕВ or 200 Sr 
Correr, 1940-1941, вх URBAN AND RURAL RESIDENCE 


UDENTS, BLANK 


Sibli Students reporting 
lings in family 
Total Urban Rural 
e 94 28 66 
a 62 20 42 
e 28 11 17 
ix 8 5 3 
i 3 2 1 
es 3 0 3 
13-14 T о Л 
То on : : T 
and of tables that do not represent 


1M, 
fre, re detailed discussion of this topic 
ч Эни? А at 
thoy distributions will be found in t h references 


е 
end of this chapter. 


he fifth and sevent! 
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is simply to indicate the nature of the entries in the columns. 
The “Total” row is often placed at the top instead of at the 
bottom of the table. One customary type of ruling is shown in 
Table 14. If it is desirable to block off one part of a table from 
another, this may be done by means of a heavy or double ruling. 

The chief requirement of a good table is that it be simple and 
clear. For this reason, it is generally unwise to subdivide the 
stub or the caption very often. One simple subdivision of the 
caption is shown in Table 14. 

In the case of every subclassification of the data in a table, 
the principles of classification already mentioned apply. 


Exercises 


1. Tabulate each of the two following series in a frequency dis- 
tribution, showing class limits, and test the accuracy of each of the 
tables. Note: The population numbers are so large that only whole 
hundreds or thousands should be used as class limits and mid-points; 
but in finding mid-points from the class limits the method suggested for 
a continuous variable should be used. An interval at least as small as 
5,000 seems to be needed to differentiate between the bulk of the county 
populations under 40,000. But above that point increasingly large 
intervals are appropriate. The last interval may be taken as “300,000 
and over,” with the actual population of the single largest county, 


318,587, given in a footnote. A table may be “broken” to avoid many 
intervals without frequencies. 
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Сковата Counties, 1930* 


Coun: E Population per 1 а Population per 
ty Population ЕЛЕНА County Population Беа She 

1 13,314 29.3 46 21,599 50.1 

2 6,894 20.9 47 18,025 45.4 

3 7,055 26.0 48 22,306 65.2 

4 7,818 21.9 49 9,461 45.5 

5 22,878 74.5 50 18,273 34.9 

6 8,703 43.7 51 2,744 “276 

7 12,401 73.8 52 10,164 22.7 

8 25,364 53.9 53 18,485 51.2 

9 13,047 51.0 54 24,101 31.5 
10 14,646 32.2 55 7,102 24.7 
n 77,042 278.1 56 12,909 2.3 
12 9,133 44.0 57 8,665 37.0 
13 6,895 15.9 58 48,667 96.9 
14 21,330 41.5 59 10,624 43.0 
15 5,952 13.8 60 15,902 57.0 
16 26,509 39.7 61 318,587 1,650.7 
17 29,224 30.6 02 7,344 16.7 
18 9,345 46.0 63 4,388 25.8 
19 10,576 37.2 64 19,400 44.2 
20 6,338 8.9 65 16,846 44.9 
21 9,903 46.9 66 19,200 43.2 
22 8,991 39.4 67 12,616. 30.3 
23 34,272 69.7 68 27,853 63.3 
24 9,421 55.7 69 12,748 44.0 
25 4,381 5.5 70 30,313 69.4 
26 105 71 18,070 24.7 
27 ner ae 72 13,203 46.7 
28 15,407 47.0 73 11,140 22.2 
29 20,003 46.0 74 15,174 а 
29 25,613 224.7 75 9,102 31.9 
31 5,924 49.1 
AA 6,943 34.2 76 x Sa 25-5 

10,260 72.3 7T р 
33 7/015 oia 78 12,199 32.8 
94 35,408 100.3 79 21,609 60.9 
s 19,739 31.2 80 8,594 25:8 
36 81 8,118 27.1 
a n 251 82 20,727 32.1 
АВ 11,311 46.9 83 12,908 АТ, 
S 25,127 56.7 En 12,081 43.4 
40 à } 8,992 23.9 
7,020 22.0 85 i 
41 9,754 53.0 
42 17,343 E 5,190 27.2 
43 4,146 32,693 40.6 
44 3,502 88 8,328 25.5 
45 23,022 E 8:153 15.0 
"Washington, 


* 
Fr 
Dict Om the Fifteenth Census of the United States, 1930, Bureau of the Census, 
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Gerorcia Counties, 1930.*— (Continued) 


i ioi 
County | Population gom County | Population pos rad 

91 7,847 27.0 126 20,503 25.8 
92 4,180 10.6 127 7.389 30.8 
93 29,994 62.1 128 23,495 112.4 
94 4,927 17.6 129 11,740 70.7 
95 9,014 31.4 130 11,114 27.0 
96 5,763 12.3 131 26,800 58.8 
97 16,643 50.1 132 8,458 27.1 
98 14,921 52.5 133 6,172 29.1 
99 6,968 19.4 134 15,411 33.1 
100 22,437 45.2 135 10,617 31.2 
101 9,076 35.9 136 14,997 40.2 
102 6,730 49.1 137 18,290 51.8 
103 23,620 43.1 138 32,612 61.5 
104 11,606 24.7 139 16,068 66.1 
105 10,020 52.7 140 17,165 43.7 
106 12,488 32.0 141 4,346 24.0 
107 9,215 26.9 142 7,488 28.6 
108 57,558 244.9 143 36,752 84.5 
109 17,290 60.0 144 11,196 48.5 
10 8,082 47.0 145 8,372 26.7 
11 12,927 25.6 146 6,340 19.6 
112 12,327 38.0 147 19,509 61.5 
113 10,268 57.4 148 26,206 60.7 
114 9,087 41.9 149 21,118 63.8 
118 12,522 36.3 150 26,558 34.4 
116 10,853 45.8 151 11,181 27.7 
17 25,141 79.3 152 25,030 37.4 
118 9,005 34.9 153 12,647 20.6 
119 8,367 23.2 154 5,032 16.7 
120 3,820 26.5 155 9,149 34.7 
121 6,331 16.8 156 6,056 24.7 
122 17,174 41.7 157 20,808 73.5 
123 72,990 228.8 158 13,439 33.3 
124 7,247 60.9 159 15,944 34.8 
125 5,347 34.7 160 10,844 23.0 

21,094 32.4 


2 "еы the Fifteenth Census of the United States, 1930, Bureau of the Census, Washington, 
2. Subdivide the table of county populations prepared in Exercise 1 
above according to population per square mile, choosing your own points 
of division in the latter factor. 
3. Open a textbook in elementary sociology to some page at random, 
and classify each word on the page as “Very short,” “Short,” *Aver- 
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age,” "Long," “Very Long." Show the results in tabular form. Do 
the same thing for an elementary textbook in economics, and compare 
the length of words in the two tables. 

4. It is wanted to know the occupation of the fathers of students 
majoring in sociology. The students are asked to check the form 
below: 

Laborer 
Businessman 
Professional 
Farmer 


Is this satisfactory? 

Where would a carpenter-contractor be placed? А policeman? The 
proprietor of a radio repair shop? 

5. A study is to be made of farm wages in your state. How would 
you define the unit of study? 

6. Explain and illustrate the meaning of these terms: (a) array, 
(b) range, (c) frequency distribution, (d) class interval, (e) mid-point, 
(f) class limits, (g) grouped data. 

T. What is the effect of tabulation by class intervals on the accuracy 
of statistics caleulated from a table? Why is this? 


References 
Cnappocx, В. E.: Principles and Methods of Statistics, Chaps. IV and V, 
Houghton Mifflin Company, Boston, 1925. 
Fifteenth Census of the United States, 1930: Population, Vol. II. 
Слввьтт, Н. E.: Statistics in Psychology and Education, рр. 1-8, Longmans, 
Green & Company, New York, 1926. 
Мила, Е. С.: Statistical Methods, теу. ed., Chap. III, Henry Holt and Com- 
pany, Inc., New York, 1938. 
Морсетт, B. D.: Statistical Tables an 


Mifflin Company, Boston, 1930. | 
Sorenson, H.: Statistics for Students of Psychology and Education, pp. 16-27, 


McGraw-Hill Book Company, Ine., New York, 1936. — 
WALKER, H. M., and W. №. Dunosr: Statistical Tables, Their Structure and 


Use, Parts I and III, Bureau of Publications, Teachers College, Colum- 
bia University, New York, 1936. 

У нить, В. C.: Social Statistics, Chap. VI, Harper & Brothers, New York, 
1933. 

Yous, G. U., and KENDALL, 
Chap. VI, Charles Griffin & Company, 


d Graphs, Part I, Chap. III, Houghton 


M. G.: An Introduction to the Theory of Statistics, 
Ltd., London, 1937. 


CHAPTER VI 
GRAPHS 


1. Graphs of Frequency Distributions.—It is often helpful in 
interpreting a frequency distribution or other statistical data 
to show the facts in graphic form. One method of picturing a 
simple frequency distribution is by means of the histogram. 
"Table 15 may be represented as shown in Fig. 7. 


TABLE 15.—Grapes Маре ву 41 Эторемтв or STATISTICS, BLANK 
' CorrEcE, 1939-1940 


Accumulated 
percentage 
frequency * 


Accumulated 
frequency 


Grades, per cent Students 


ae 


63-67 
68-72 


* Each accumulated frequency is expressed as a percentage of 41, eg., = 17. 


In connection with the histogram, it should be noticed that, 
if the class intervals are taken as one unit each, the area of the 


figure is equal to the total frequency of the table. In Fig. 7, 
for example, 


area —2X1-F6X1-F6X1--10x1--11x14- 
5хХ1-2Х1= 41. 


А second device for picturing a simple frequency distribution is 
the frequency polygon, which is constructed by connecting the 
mid-points of the class intervals of the histogram by straight 
lines. It is shown in Fig. 8. If it is extended to the base line 
at the mid-points of the intervals next beyond the end intervals, 

76 
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and all equal intervals are taken as one unit in width, its total 
area is equal to that of the total frequency of the table, but the 
area over any one interval is usually not equal to the frequency 
in that interval. 


y 
15 
2 
Ёр 
> | 
55 
5% 
5661518 ву во 980 
Grades 


Fic. 7.—Histogram of Table 15.1 
A histogram of the simple frequency distribution of Table 16, 
which has unequal intervals, appears in Fig. 9. 


-TABLE 16.—AGE DISTRIBUTION FOR MEXICANS IN THE UNITED 
Srares, 1930* 


Age, Years Number (Thousands) 
Under 5 e aT, CES 
Ж-О... Мы» 20.55 
ЛОТА. o eM 14.81 
15-19 13.72 
20-24 14.65 
25-29 13.53 
30-34. . 10.11 
35-44.. 16.30 
45-54.......... 9.58 
55—64.......... 4.60 
65-74...... 1.96 
75 апа оуег 0.88 
Total 3.5: 3 b EX HE Ае 142.12 

* Adapted from Abstract of the Fifteenth Census of the United States, 1930, Bureau ofthe 


Census, 


If we let each interval of five years on the base line be one unit, 
then, of course, an interval of 10 years will be two units, and the 
height of the rectangle in a 10-year interval will be one-half of 
the tabular frequency in that interval. The end interval, “75 


se of frequency distributions, which 


1 Notice that hi h as tho: 
1 graphs, suc 

Involve two sets of measurements, are erected on the framework of two 
Sraduated straight lines drawn at right angles. The horizontal line is 
called the X axis, the perpendicular line the Y axis. Frequencies are con- 


Ventionally measured on the Y axis (but see Fig. 13), scale values on the X 
axis (see Fig. 7). 
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and over,” in Table 16 is of unspecified length, and so cannot 

be accurately represented geometrically. It is accordingly 

omitted from the graph, and its frequency removed from the 

total. The sum of the areas of the remaining rectangles is then 

equal to the corrected total frequency of the table. Moreover, 
Y 


sa 


Students 


сл 


65 68 73 18 83 


Grades 
Fie. 8.—Frequency polygon of Table 15. 


88 93 98 


the area of each rectangle is equal to the frequency in the cor- 
responding interval. 

If a polygon is drawn on Fig. 9 in the usual way, neither the 
total area of the polygon nor the area in any interval will be 
equal to the corresponding tabular frequency. The total area 
can be made equal to the total frequency, however, if the polygon 
is drawn to the mid-points of five-year intervals throughout, 
using the same frequencies (heights) as in Fig. 9. 


18 | | 


16 | 

L 14 

rE 
10 

28 | J 
6 
[ І + 
2 
0 


D. 5 310 15 720 25 30, 35 40 45 50 55 60 65 10 15 
Age, years 


Fic. 9.—Histogram of Table 16, with unequal intervals. 

Notice that if the frequencies were known and graphed for 
each year of age, instead of for each five- or 10-year age interval, 
the rectangles of the histogram in Fig. 9 would become more 
numerous and narrow. If then the frequencies in each year 
were separated by months, we should have still more and nar- 
rower rectangles. If this process of subdivision of intervals 


X 
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were continued indefinitely, we should have a smooth curve 
instead of a histogram or a polygon in Fig. 9. It is apparent that 
if each minute interval were then regarded as being one unit in 
width, the area under any part of the smooth curve would be 
equal to the frequency over the same portion of the table (see 
Fig. 10). А great deal of use is made of this fact in some of the _ 
chapters that follow. 

A polygon may be smoothed by passing through it a freehand 
curve. This is a somewhat questionable way of judging how the 
distribution would appear if the size of the sample were greatly 
increased. 


Y 
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+ 


25 30 35 40 45 50 55 60 65 10 75 80 
Age, years Л 
Fra. 10.—Histogram of Fig. 9 reduced to a smooth curve. 

When histograms or polygons are to be compared, they 
Should be graphed in terms of percentage rather than absolute 
frequencies. | р A 

A very useful type of graph in the interpretation o pee 
quency distribution is the cumulative curve, Or ogwe he 
accumulated frequencies for Table 15, forming а s а 
frequency distribution, may be seen in the last column g t А 
table. Since each accumulated frequency merely shows ae 
total number of values that are less than the lower limit of the 
class just above on the scale, a frequency distribution in заа 
lative form is sometimes called a less than frequency distri ur 
Plotting should be done carefully on coordinate paper, in p vs 
that the resulting graph may be accurate enough s pm g 
Purposes. The cumulative frequency curve for Table 


shown in Fig. 11. 


X 


D: сы 10115220 
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Notice in Fig. 11 that the frequency in each class interval is 
plotted on the upper class limit, to show that a particular number 
of students made a grade less than the one indicated by that 
limit. Thus, 34 students in the given course made grades less 
than 88, and 39 made grades below 93. 

Not only does the cumulative frequency curve give a picture 
of the distribution of frequencies that is different from that 
shown by the histogram or polygon, but it may also be employed 
for interpolation and computation. If, for example, we are 
given Table 15 but know nothing else about the data, and wish 
to change the class limits of the table, we can sometimes do this 


0 
63 68 13 18 83 88 93 98 
Grades 
Fia. 11.—" Less-than"' cumulative frequency curve or ogive, for Table 15. 


most conveniently by means of the cumulative curve. Suppose 
that we want the class limits of 65 to. 69, 70 to 74, 75 to 79, 
and so on. How many students made grades falling in each of 
these new intervals? This can be decided approximately by’ 
erecting perpendiculars at the points 65, 70, 75, and so on, on 
the base scale, noting where they intersect the cumulative curve, 
and drawing horizontal lines from these points of intersection 
to the frequency scale at the left. Thus, the horizontals cut 
the frequency scale at approximately the values 1, 5, 10, 18, 29, 
37, 40. We can accordingly set up a new frequency table, 
Table 17, whose last column is obtained by subtracting in the 
second column the accumulative frequency in each class from 
that in the class just above. 

When it is desired simply to halve or combine the class intervals 
of a simple frequency distribution, the work may be done by 
direct division or addition more easily than by the use of a 
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TABLE 17.—InLUsTRATING CHANGE OF LASS INTERVALS OF TABLE 15, BY 
Use or CUMULATIVE FREQUENCY CURVE 


Grades, per cent | Accumulative frequency | Students 
60-64 1 1 
65-69 5 4 
70-74 10 5 
75-79 18 8 
80-84 29 11 
85-89 37 8 
90-94 40 3 
95-99 41 1 

Тобаа 56 41 
ее E 


cumulative curve. Thus, if in Table 15 the intervals are to be 
halved, the frequencies in each interval are also halved cor- 
respondingly. It is often desirable, however, to modify this 


Y: 


S 
o 


о 


оо 
о o 


DS] 
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Cumulative percentage frequency 


AH- 
6j c8 13 18 83 88 B 98 
Grades i 

Fic. 12.—Ogive in terms of percentage frequencies. 

method somewhat by allowing for the shape of the curve. ү; 
example, if the curve is rising in the interval, more of ш fre- 
quencies may be placed in the upper than ш the lower subdivision 
of the interval. 


Percentage fre jes are O 
quencies о : 
frequencies on the ogive. Figure 12 is the same as Fig. 11 except 


for this change. From it we read on the Y axis that 50 per cent 
of the students made a grade of less than 82 on the X n 
Approximately; 75 per cent make less than & grade of ues E 
and so on. The readings can be more accurate if finely rule 


Coordinate paper is used. 


ften substituted for absolute 
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Values may be accumulated on both scales, Х and Y, and 
expressed as percentages of their respective totals. This has 
been done in Table 18 and Fig. 18. Each pair of accumulative 
percentages determines a point, and they are called the 
coordinates of the point. For example, the first two accumula- 
tive percentages in the table furnish the coordinates (6.6, 0.2), 
the one on the left (6.6) being an X value and the one on the 
right (0.2) a Y value. The point is located on the chart by going 
a distance of 6.6 percentage units from 0 along the X axis, and 
then perpendicularly up a distance of 0.2Y percentage units. 


TABLE 18.—NuwBrR or Farms ву Size, Kansas, 1930* 


Par Gent Accumulated 
Size of Number Total per cent 
farm, acres of farms acreage 
Farms | Acres | Farms | Acres 
Under 20...| 11,004 86,739 6.6 0.2 6.6 0.2 
20- 49... 9,264 312,710 5.6 0.7 | 12.2 0.9 
50- 99....| 19,226 | 1,475,364 | 11.6 3.1| 23.8 4.0 
100- 174....| 42,920 | 6,319,557 | 25.8 | 13.5 | 49.6 | 17.5 
175- 259...] 25,481 5,565,698 | 15.4] 11.8| 65.0] 29.3 
260- 499....| 38,385 | 13,796,240 | 23.1 | 29.4 | 88.1] 58.7 
500- 999....| 15,055 | 10,243,252 9.1 | 21.8 | 97.2] 80.5 
E ,000-4,999. . 4,487 | 7,184,515 2.7 | 15.3 | 99.9 | 95.8 
5,000 ead over. 220 | 1,991,572 0.1 4.2 | 100.0 | 100.0 
ШОКАЙ, за ste 166,042 | 46,975,647 | 100.0 | 100.0 


* Adapted from Fifteenth Census of the United States, Bureau of the Census. 


The resulting curve is called the Lorenz curve. From it we can 
see that 50 per cent of the farms, i.e., the small farms (reading 
from the left on the X axis) include 18 per cent of the total farm 
acreage (reading from the bottom on the Y axis); that about 
100 — 65 = 35 per cent of the farms, ż.e., the large farms (read- 
ing from the right on the X axis) include 100 — 30 = 70 per cent 
of the total farm acreage (reading from the top on the Y axis); 
and so on. 

Further uses of the cumulative curve for computation will be 
shown later, under the topic of partition values (Chap. VIII). 

2. Graphs of Time Series.—Statistical data often take the 
form of a time series, rather than of a frequency distribution. A 
time series is a set of values of a variable that correspond to 
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98 time intervals, such as years or months. For example, 
е populations of a state in 1920 and again in 1930 are a very 
brief time series (see Fig. 14). 

a plotting the increase of one variable, e.g., the population 
of a state, in terms of a second variable, e.g., years, it is often 


y 


40 &© 
[) 

№ 

20 = 


0 X 
0 20 40 60 80 100 
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Fic. 13.—Lorenz curve, for Table 18. 
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Fic. 14.— Population growth i 


of more interest to show the proportionate increase than the 
lation of 3.0 millions 


E bsolute increase. For example, if à popu 
d ese to 3.5 millions in 10 years, the increase is much less 
™Mpressive than when a population of 0.2 million increases to 
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0.7 million in the same period. Yet if the absolute increase is 
plotted, this difference will not appear, as may be seen from Fig. 
14, where the two growth lines are exactly parallel. To meet 
this objection, the percentage increase may be plotted. The 
growth from 3.0 to 3.5 millions is a percentage increase of 17, 
that from 0.2 to 0.7 million is a percentage increase of 250. This 
is shown in Fig. 15, where the line representing the growth 
of the population of 0.2 million is much steeper than that repre- 
senting the growth of the population of 3.0 million. In making 
Fig. 15, the rate of growth in terms of the initial population is 
required. Instead of going to the trouble of computing these 
rates, much the same results may be accomplished by plotting 


1920 1922 1924 1926 1928 1930 + 


Үеаг 
Ета. 15.—Population growth in terms of percentage increase. 
the absolute figures on a semilogarithmic scale. The latter 
method is usually preferred to the former, because semilogarithmic 
m can be obtained at small cost, and the use of it saves much 
abor. 

Figure 16 shows the above population figures plotted directly 
on semilogarithmic paper. 

In Fig. 16, notice that the increase in population from 0.2 to 
0.7 million is again represented by а much steeper line than is 
the increase from 3.0 to 3.5 millions. 

While the semilogarithmic scale does not show in strictly 
accurate proportion one to another all percentage changes, it 
represents equal percentage changes by equal slopes, and saves 
much labor compared with percentage charts such as that shown 
in Fig. 15. 

In using semilogarithmic paper, the repeated series of values 
1 to 9 usually printed on the vertical scale may be multiplied by 
any constant, provided the constant is applied to the whole scale. 
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Thus, in Fig. 16 the scale may.be multiplied, say, by 7, by 0.5, 
or by any other number, when thereby it will be made more 
convenient for the plotting of particular data. A semilogarith- 
mic scale cannot contain a zero value. 
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ted on semilogarithmic paper. 


Ета. 16.—Population growth ploti 
In all graphic representation of data, the shape 2 the Pr 
9r figure is affected by the ratio of the X and Y scales. Since 
this ratio is usually a matter of arbitrary choice, SETE r 
Sometimes taken of the opportunity to produce keine Ven 
impressions. Figures 17 through 19 from Table 19 illustr: 
Ошу three of many possibilities. 
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TABLE 19.—INcREASE IN ENROLLMENT OF THE BLANK MILITARY AcADEMY, 


1928-1938 
Year Enrollment Year Enrollment 
1928 110 1934 118 
1929 113 1935 120 
1930 114 1936 122 
1931 13 7 1937 127 
1932 118 1938 130 


1933 119 


In Fig. 17, à rather moderate increase in enrollment is made 
impressive by (1) using a large single-unit spacing on the Y scale, 
(2) starting the increase from the base (bottom) line, and so 
avoiding any comparison between the amount of increase and the 
original volume of enrollment, (3) showing each year's increase as 
a percentage of the enrollment in 1928, instead of as a percentage 


Y 


s З 


N 


> с 


Per cent increase in enrollment 


0 
1928 1930 1932 1934 1936 1938 A 
ear 
Fic. 17.—Graph of data of Table 19. 


of the enrollment of the preceding year. Figure 18 removes 
criticism (2) above, and avoids criticism (3) by using absolute 
enrollment figures instead of percentages of increase relative 
to the total enrollment in 1928. Figure 18 is still open to 
criticism (1) above, because the ratio of the X and Y units is 
not changed. In fact, the X and Y units are different in nature, 
so that it is impossible to say when one bears a just relation to 
the other. 

Figure 19 meets criticism (3) by plotting the enrollment 
figures on a semilogarithmic scale. The total enrollment is not 
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entirely pictured in the diagram because the semilogarithmic 
scale begins at 1 instead of at 0, but this is & minor matter. 
Evidently, the growth of the school makes a much poorer showing 
in Fig. 19 than in either Fig. 17 or Fig. 18. Probably Fig. 19 
gives the most realistic picture of the facts in this particular case. 


Y, 
150 


125 | 


100 


Enrollment 
a 


25 


0 X 
1928 1930 1932 1934 1936 1938 
Yea 


Fra. 18,—Absolute increase in enrollment of the blank military academy, 


1928-1938. 


3. Miscellaneous Graphs.—A common device for the graphic 


comparison of amounts or percentages is the bar chart, either 
f Fig. 7 above can be 


upright or horizontal. The histogram © 
Tegarded as essentially an upright bar chart. Figure 20 shows а 
horizontal bar chart applied to Table 20. 


TABLE 20. _Ревсектлав or Females 15-44 YEARS OF Ace MARRIED, 


SELECTED EUROPEAN CouNTRIES* 
Percentage of 


Females Married 


Count; 
Bulgaria... i ER nuoc sees met вто 67.0 
England and Wales .. 48.5 
France "S yc 
Germany .. 48.4 
ТА] ыы cce di 8 vr р si 


Dag I ее 
Adapted from W. S. Tioxraox, Population Problems, 
ompany, New York, 1935. 
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Mer 


Fic. 19.—Rate of increase in enrollment of the blank military academy, 1928- 
1938. 


Bulgaria 

England and Wales 
France 

Germany 

Italy 

Sweden 


о 10 20 30 40 50 60 70 
] Percentage of females married 


Fic. 20.—Percentages of females 15-44 years of age married in selected European 
countries. (From W. S. Thompson, op. cit., p. 104.) 
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Two variations of the bar chart are seen in Figs. 21 and 22. 

Instead of the bar chart, comparisons are often made in terms 
of the areas of squares or circles, or of the volumes of cubes or 
spheres, as in Figs. 23 and 24. These devices, however, force the 


Fig. 21.—Percentage of the population of the United States represented by 
each race, 1930. (Adapted from R. Clyde White, Social Statistics, p. 178, Harper 
& Brothers, New York, 1933.) 


White 663% 


White 74.2% 


aces among the commitments 


Ета. 22.—Percentage of white and Negro ri 
(From R. Clyde White. op. cil., 


to prisons and reformatories, 1910 and 1923. 


P. 179.) 
71% 
| 71% = 
Native born Foreign born ' Native born Foreign born 


Fic. 23.—Ratio of native Fig. 24.—Ratio of native 
born to foreign born in City born to foreign born in City 


X: 1930. X: 1930. 
counfrres -. 7 
у есш: 


Canada-~ || 
dapted from G. R. Davies 


Fre. 25.—World distribution of telephones. (A 
oar Fodar, Bustos о, р. 40, John Wiley & Sons, Inc, New York, 


eye to perform the rather difficult feat of measuring two 
тее dimensions simultaneously. 
The so-called “pie chart," pictured 
9r showing how a whole is subdivided. 


All ofher 
c£urope 


j 


-United 
Sates 


or even 


in Fig. 25, is convenient 
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MARCH 1931 


United States, 
Federal Emergency Relief Adm 


MARCH 1929 


istration, 


inis 


March, 1929, and March, 


(Adapted from On Relief, 


Fic. 26.—Estimated unemployment, 
Chart IX.) 


1931. 


LEGEND 


40.1% and over 


Fic. 27.— Percentage of unemployment relief expenditure paid locally, state of 
Wisconsin, year ending Sept. 1, 1934. 
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More realistic and striking than any of t| i i 
are pictograms, of which T 26 is Алор ccr) 
E are treated in many ingenious ways for statistical 

rposes. Crosshatching (see Fig. 27), the insertion of picto- 
grams, and spotting are common devices. 
oe any attempt to present statistical figures in graphic form, 

e following two principles are to be kept in mind. (1) The 
a шө pus be more quickly and easily comprehended than the 
а ata in tabular or nongraphic form. Graphs are sometimes 
3d о or ingenious that they can be deciphered only with 
a of the textual and tabular material that they are intended 

clarify. (2) The graph should not misrepresent or exaggerate 
the facts. 

Exercises 


Me following figures taken from the 
5 represent the growth in the popul: 


Fifteenth Census of the United 
ation of Milwaukee: 


1930 578,249 - 115,587 
1920 457,147 71,440 
1910 373,857 45,246 
1900 285,315 20,061 
1890 204,468 1,712 


М these data graphically. 
Gus Suppose that you grade t 
ER in a reform school and want to post a Wee 
E Hg each delinquent. Describe briefly th 
3. The charts on page 92 show how the пише 


group of juvenile delin- 
kly chart showing the 
e kind of chart you 


he behavior of a 


rs of the insane, epilep- 
tate institutions and the prison 


ti 
tic, and feeble-minded persons in 8 
d in the last 25 years. 


eee in a certain state have increase 

A 07 you any criticism of these charts? 
that at the distribution shown below as & frequency polygon. Show 
the е area under the polygon is equal to the total frequency, but that 
area in some intervals is not equal to the frequency in those intervals. 


Disrrmurion or 106 EMPLOYEES BY Аав CLASS 


Age of Employee, Years Employees 
15-24 14 
25-34 49 
35-44 23 
45-54 13 
55-64 d 
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1479 4,562 6,995 
FEEBLE MINDED AND EPILEPTIC 


5. Rearrange the frequencies of question 4 in class intervals of 15-18, 
19-22, 23-26, and so on. 


6. Plot the data of Problem 4 as an ogive, and read off the age below 
which 75 per cent of the employees fall. 


7. Devise a problem for which a Lorenz curve is suitable, graph the 
curve from your data, and show its use. 


8. Plot the following data in such a way that (a) it gives an unbiased 
picture of the rate of change, (b) it exaggerates the impression of the 
rate of change. 
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PopunaTion оғ Manson, Wis. 1890-1940* 


Year Population 
1890 13,426 
1900 19,164 
1910 25,531 
1920 38,878 
1980 57,899 
1940 66,802 


+ From the Fifteenth Census of the United States, Bureau of the Census. 
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СНАРТЕВ, УП 
AVERAGES AND RATES 


1. The Need for an Average.—An investigator is interested, let 
us say, in the height of the residents of a certain Swiss com- 
munity in the United States, on the theory that they are taller ` 
than their relatives in the old country. Unable to measure 
the whole community of over 4,000 persons, he takes а random 
sample of, say, 182 adult males, and gets their heights as accu- 
rately as possible. He then finds himself with 182 individual 
measurements. What will he do with them? He may perhaps 
first arrange them in order of magnitude, to form an array. 
If no two of the measurements happen to be identical, he will 
still have 182 different measurements. In any case, it will be 
impossible for him to hold in mind all the separate values, and 
he will feel the need of some one figure by which to represent 
them. This need will be still greater when he attempts to 
determine whether or not the American group is taller than a 
similar group in the Old World, because some of the former will 
be taller than some of the latter, and vice versa. In his search 
for a single figure by which to represent the many, he will cer- 
tainly arrive at the idea of calculating an average. 

2. The Mode.—The simplest form of average is the mode (Mo), 
which is merely the value in a series that occurs most often. 
If the heights are all different, there can be no mode in ungrouped 
data. If some persons are of the same height, however, a mode 
may occur in our array. We then choose as the mode the height 
that occurs the greatest number of times. For example, in the 
following array of the heights of 10 European Swiss males, 4 ft. 
11 in., 5 ft. 3 in., 5 ft. 7 in., 5 ft. 8 in., 5 ft. 9 in., 5 ft. 9 in., 5 ft. 
10 in., 5 ft. 11 in., 6 ft. 0 in., the mode is 5 ft. 9 in.; but of course 
the sample is too small to give much information about the 
modal height of European Swiss males in general. Whether or 
not a mode is convincing depends on how conspicuously the 
modal height stands out above the others in frequency of occur- 


rence. If the height 5 ft. 7 in. occurs 10 times and the height 5 ft. 
94 


AVERAGES AND RATES 95 


7.3 in. occurs nine times, it is not certain that one is significantly 
more frequent than the other. : 
4 The situation becomes clearer if we decide to overlook slight 

differences in height, and combine our measurements in care- 
fully chosen class intervals, as in Table 21. If the distribution 
is rather regular, we may by inspection then determine whether 
or not any one class interval has 
a sufficiently larger frequency 
than any other to be confidently с 
regarded as the modal class. If + 
so, that is usually all we need 
to know. In Table 21, col. (1), 
the modal interval is evidently 
60 to 64 inches. In col. (2), We СУ ИСТЕ 
2) po has two modes, Height in inches ш 
‚ 10 18 bimodal, suggesting Fre. 28.—Histogram of bimodal 
that it may contain both males frequency distribution of Table 21, 
and females. In such cases, № °% 2° 
often helps to plot the data in the form of a frequency polygon 
or histogram, e.g., Fig. 28. 
TABLE 21.—Ныантз or 165 AND 182 AMERICAN MALES 


or Swiss DESCENT 


Males 
Height, inches | 

(1) (2) 

— — — 
45-49 2 2 
50-54 10 10 
55-59 21 55 
60-64 55 21 
65-69 40 57 
70-74 32 32 
75-79 5 5 
Total... 165 182 


Determination of the exact modal value in grouped data is 
Complex, and cannot be treated here. Several rough methods 
of interpolating within the modal class are available, such as that 
of formula (1). $ 

AL i (1)? 
= $ 
mo = + (x) 


1 A is the capital Greek letter Delta. 
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where L is the lower limit of the modal class, A; is the difference 
(disregarding signs) between the frequency of the modal class 
and the frequency of the class just below the modal class on the 
scale, Az is the difference (disregarding signs) between the 
frequency of the modal class and the frequency of the class just 
above the modal class, and 7 is the size of the modal class interval. 
Applying this formula to the distribution of Table 21, col. (1), 
we find the crude mode, 


(55 — 21)(5) 


Mo 5 атур (B5 — 30)! 


Mo — 63.5. 


Another approximate method of finding the mode of a fre- 
quency distribution is provided by formula (2): 


Mo = M — 3(M — Ma). (2)! 


where M is the arithmetic mean of the distribution, and Md is the 
median, as described below. Assuming that for the distribution 


of Table 21, col. (1), M = 64.68, and Md = 64.5, formula (2) . 


gives for the crude mode: 


Mo = 64.68 — 3(64.68 — 64.5). 
Mo — 64.1. } 


This value is a little different from that found by formula (1). 
Mention of the conditions under which formula (2) is appropri- 
ate is made in the Sec. 5, below. 


3. The Median.—Suppose that we have the following array 
of the heights of 11 American adult males of Swiss descent: 


TABLE 22.—HEIGHTS оғ 11 AMERICAN ADULT Mates or Swiss Descent 
Male Height, inches 

68 

68.5 

69 

69.5 

70 

71 

71.5 

72 

72.5 

73 

74- 


ноохочеомтьюьн-н 


ee 


1 See derivation of formula (54), Chap. IX. 
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A quick way of getting some idea of an average for this series 
would be to note the height that stands at the middle of the 
series. This is seen to be 71 inches, or height number 6 in rank 
order. This kind of average is called the median, which is 
defined as the middle value, or that value which is exceeded by 
as many values as it exceeds. 

Now if a twelfth person of, say, height 75 inches is added to the 
above group, а difficulty arises. There is no middle value. 
Unless we are willing to take the mean height of the sixth and 


seventh persons ыш = 71.28) as the median, we must 
say that there is none. Although the median so found, 71.25 
inches, is a height that does not actually appear in the series, 
it is customary for most purposes to accept it as the median. 

Consider another common case. Let the height of the fifth 
person in the first group of 11 persons be 71 inches. Again, 
strictly speaking, there can be no median that meets the defini- 
tion, because there are no longer as many heights below the 
middle height as there are above it. As before, a compromise 
is commonly made by taking the middle value (71 inches) as the 
median. 

From the above, it will be noticed that 
the median value in an ungrouped series is to а 
number of values and divide by two: 

N+1 
ee ‚ © 
= 11, the position of the median value is 


the formula for locating 
dd one to the 


Thus, above, where N 


ued = = = 6, or the value in position 6; and where N = 12, 


B i „шке 6.5, the median value is the height in position 
6.5, which can only be the mean of the heights in positions 6 and 7. 

The above relates to ungrouped data. When the items of a 
series are grouped in class intervals, the median is regarded 
as the value on the X scale that divides the area of the frequency 
histogram or curve into two equal parts, as shown in Fig. 28. 
Thus, in Table 21, col. (2), №/2 = 182/2 = 91. Now, 88 fre- 
quencies fall below the class limit, 65 inches, so that 91 — 88 = 3 
frequencies fall inside the interval 65 to 69. Since there are 57 
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frequencies in this interval, and the width of the interval is 
5 inches, the median falls $ X 5 — 0.263 inch inside the interval, 
or at the point 65 + .263 = 65.263 inches on the X scale. The 


area below the median is then 2 4-10 + 55 + 21 4- ae 
(57) = 91, that above the median is gx (57) + 32 + 5 = 91, 


and the two areas are equal. 

The simplest way to find the median of grouped data is as 
follows: Accumulate the frequencies, as in the last column of 
Table 23. Divide N by 2: 165/2 = 82.5. Look down the 
column of accumulated frequencies until the frequency in the 
position 82.5 is found, in the interval 60-64. From 82.5 sub- 
tract the accumulated number of frequencies below the median 
interval: 82.5 — 33 = 49.5. Multiply the width of the class 
interval by the fraction 49.5/55, formed by the difference just 
found as numerator and the frequency of the median interval 
as the denominator: 5 X 49.5/55 = 4.5. Add this quotient to 
the lower limit of the median interval: 60 + 4.5 = 64.5. This 
is the median height for the table. 


TABLE 23.—Heicut or 165 AMERICAN ADULT MALES or Swiss DESCENT 


Males 
Height, inches 
Number | Accumulated number 

2 2 
10 12 
21 33 
55 88 
40 128 
32 160 

5 165 

165 


We can express the above steps by means of a formula, which 
is applicable to frequency distributions: 


N_ 
Md =L+ =). (4) 
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where Г, is the lower limit of the class interval in which the 
median falls, F is the number of accumulated frequencies that fall 
below (i.e., in class intervals with limits smaller than those of) the 
median class interval, f is the number of frequencies in the 
median class interval, ¢ is the size of the median class interval, 
and N is the total frequency of table. 17/2 is first found, and 
then the remaining symbols can be evaluated and substituted 
in the formula, as indicated in the preceding paragraph. Thus, 


235 — 33 
for the problem above, Md — 00 +( E )s = 64.5, as 


before. 

4. The Arithmetic Mean.—The arithmetic mean, M, is the 
type of average that is most often used. It is the sum of the 
X values divided by their number, №: 

УХ 
сеш 5)* 
_ e 


For example, in the case of the ungrouped values, 3,7,2, 12, 
1, 16, 4, representing the numbers of children in seven Italian 
immigrant families, their sum is 49, and there are seven of them, 
80 that M = 45/7 = 6.43. 

If some of the above values 
might have 


had occurred more than once, we 


X (Continued) 
4 


7 
7 


вооон 
<3 


111 
М === = 6.17 
18 


But, as shown in Chap. V, this long array may be condensed: 
* The Greek letter, £, capital Sigma, meals to sum, or add, the X values. 
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TABLE 24.—NUMBER oF CHILDREN IN 18 ITALIAN ĪMMIGRANT FAMILIES 


Children | Families 


(X) (+ fXt 

1 1 1 

2 2 4 

3 3 9 

4 5 20 

7 3 21 

12 2 24 

16 2 32 
Total ..... 18 111 


* Frequency. 
+ Frequency multiplied by X. 


In the case of grouped data, it is more convenient to write 

formula (5) in the form 
м = ME, 6) 
where f is the frequency. 
Substituting in formula (6) N = 18 and the total of the third 
column of the array just above, 
М = +33 = 6.17, 

as before. 

Formula (6) may be applied to any frequency distribution, e.g., 
that of Table 25. 


TABLE 25.—Heicur or 165 American Apuur Mates or Swiss DESCENT 


Height Adult males 
Inches x* f fX 
45-49 47.5 2 95.0 
50-54 52.5 10 525.0 
55-59 57.5 21 1207.5 
60-64 62.5 55 3437.5 
65-69 67.5 40 2700.0 
10-74 72.5 32 2320.0 
75-79 77.5 5 387.5 
това Sete 165 10,672.5 


м 


* Mid-points. ы 
_ 10,0725 _ 
М = 165 um 64.68. 
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As pointed out earlier, the mean calculated from a frequency 
table in which the mid-points are not identical with the means 
within the intervals is, of course, somewhat inaccurate, as is 
any other average or statistic found from such a table. | 

It is possible to simplify the calculations needed to find the 
arithmetie mean (usually called simply the mean) in a frequency 
distribution such as that of Table 25. Suppose that the mid- 
points of any distribution are Xi, Xs Xs, ею. They can be 


Olin АКЕ; Xo ек. 
Ета. 29.—Diagram used in derivation of formula (13). 
represented by the above diagram, where / is the frequency 
in the X, interval, etc., measured along the Y axis. 
By formula (6), 
ZzfX 
But suppose that we choose to measure the X values from some 
arbitrarily assumed or “guessed” point on the X axis, say A, 
in Fig. 29. Then 
Xi = А + 214 (7) 
) x. x 

їн the X"s represent the distances of the X’ 


ize of the X"s by dividing 
or other constant, we have 


в measured from 


Sr we further choose to reduce the s 
em by the size of the class interval, 4, 


Xy _ 4, (8) 


р 
, 
Xy = ds, etC., 
i 
or 


Xi = dii, (9) 
X4 = di, ete. 
Substituting the values of Xi and X; from (9) in (7), 
(10) _ 
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Substituting from (10) in (6), 
М = = + di) (11) 


YA, au 


TA 


Constants can always be dos ed the summation sign,! so 
that 


M= a + 72/4 Zo 
Now 
ху = М. 
So that 
M - AT +0, 
or 
Mim £225 тш. (13) 


Let us apply formula (13) to find the mean of the frequency 
distribution of Table 26: 


TABLE 26.—Heicur or 165 AMERICAN Apuur MALES or Swiss DESCENT 


* Mid-points. 


In the above table, by arbitrary choice, the assumed mean is 


А = 62.5. 

t= 5. 
zfd = 72. 

N = 165. 


1 Notice the principle that 
У(Х +Y) = 2X + 2Y a2) 
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Therefore, substituting in formula (13), 


" .5(72) 
М = 62.5 + 7165) 
М = 62.5 + 5(.436), 
М = 62.5 + 2.18, 

М = 64.68, 


which is the same as we found the mean to be by the “long” 
method. The calculations required in Table 26 are greatly 
reduced compared with those in Table 25. 

The second column of Table 26 is inserted for explanatory 
Purposes only, and is omitted except when irregular class intervals 
cause difficulties, Table 27 illustrates the usual form for 
computation. 


T 
ABLE 27.—N uwnER or RELIEF CasES PER BLOCK IN А SLUM AREA OF A 
Сїтү 


NIPON 
© © © ên Ore o én 


Notice that it makes no diference in the result Mor 
.е., the assumed mean—is placed. А good way to chee 


ae Work is to perform the calculations from two different 


ned means. , "m 
lula (13 i lar class intervals if i, 
i la (13) also holds for irregu uie 


ch m, Е e. 
duke en be illustrated below. 
for Wisconsin, from the 


Ыз: з o 
E Interval, is held constant. This may 

102 Sider this table of age distribution 
30 census. 
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TABLE 28.—AcE.DISTRIBUTION OF PoruLATION, Wisconsin, 1930 
Per 21 y. ХА _ Accumu- 
Age cent f 2х ibo 5 сш lated ft 
9.2 2.5 | —20.0 —4.0 —36.8 9.2 
9.9 7.5 | —15.0 —3.0 —29.7 19.1 
9.7 | 12.5 | —10.0 —2.0 |—19.4 28.8 
9.2 | 17.5 | — 5.0 —1.0 |— 9.2] 38.0 
8.3 | 22.5 0 0 0 46.3 
7.7 | 27.5 | + 5.0 +1.0 |+ 7.7) 54.0 
7.4 | 32.5 | +10.0 +2.0 |-14.8| 61.4 
14.0 | 40.0 | +17.5 +3.5 |-49.0] 75.4 
10.6 | 50.0 | +27.5 +5.5 +58. 3) 86.0 
7.2 | 60.0 | +37.5 +7.5 |-+54.0] 93.2 
4.6 | 70.0 | +47.5 +9.5 +43.7 97.8 
2.0 Ё ? ? ? 99.8 
9934 ме]: ье atu 132.4 


* Mid-points. The census records age in whole years, as of the last birthday. But since 
the actual ages are not discrete, age should be treated as continuous. Otherwise, all aver- 
ages will be too low. 

T Accumulated frequencies, 


Finding the mean for the table below age 75! by substituting 
in formula (13), 


M-22545 (2s) = 22.5 + 5(1.354) 


97.8 
М = 29.27 


In such a table as this, with an open interval, only the median 
and mode can be found for the total table. Why? What is 
the median value for Table 28? 

5. Interpretation of the Common Averages.—The arithmetic 
mean, M, is the most familiar type of average. It is amenable 
to algebraic operations which cannot be applied to the median 
or mode. Suppose we know that the mean of one distribution of 
50 items is 4, and the mean of a second comparable distribu- 
tion of 75 items is 6. Then the mean of both distributions is 


4 X 50 6 
SES D i e XP. 5.2. The only accurate way of finding 


1 The mean, of course, cannot be found for the table including the open 


interval, “75 and over," because no mid-point can be assigned to an open 
interval. 
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the median of the total distribution is actually to combine the 
distributions, interval by interval, and recompute the median 
and mode for the combined distribution, just as was done for the 
Separate distributions. If there are several medians given, it is 
Possible to find the median median, but it is not likely to be the 
Same as the median of the combined distributions. Although 
the mean of two or more medians is sometimes used, the meaning 
of such a combination of averages is not clear. “А correct total 
cannot be obtained by multiplying the median by the number of 
items” in a distribution.! 

A second characteristic of the mean is that it alone of the 
three averages reflects the exact value of every item. If extreme 
values occur in a series, they affect the mean much more than 
the median or the mode, because the median is affected only 
by the circumstance that an item is greater or smaller than the 
Median item—the amount of the difference being of no conse- 
quence—and the mode is affected only by whether or not the 
size of a value throws it into one class interval or another. Con- 
Sider the series of ages in years, 2, 4, 7, 10, 13, 15,19. M= 10, 
Md=10. If the three items that are larger than 10 are replaced by 
three others also larger than 10, the Md stays the same, but the 
M changes, Thus for 2, 4, 7, 10, 58, 70, 80, М = 33, Md = 10. 

‘his is sometimes an advantage of the mean, and sometimes a 
disadvantage. If the extreme values are regarded as atypical 
of the series, the median will be a better average than the mean, 

ecause the median is less influenced by such values. | И, on 
the other hand, the extreme values are thought to be an integral 
Part of the series and to deserve full weight, then the mean is 
More appropriate than the median. In series where the mean 
Seems inappropriate, it is often advisable to question the repre- 
Sentativeness of any average, and to drop the atypical items. 
third important trait of the mean is that it usually changes 
less than the other two averages, from sample to sample. 

UPpose that the I.Q.'s of the first 100 students met on a Colas 
campus are taken and the M, Md, and Mo of these I.Q.’s are 
Computed. The same thing is done with a second hundred stu- 

ents, a third, апа во оп. Then the differences between the means 

е several samples will generally be less than the differences 

2 Witrorp I. Kine, The Elements of Statistical Method, p. 131, The Mac- 
an Company, New York, 1918. 
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between the medians ог the modes. This sampling stability of 
the mean is very much in its favor. 

For such reasons as the above, in the averaging of measure- 
ments the arithmetic mean is always to be preferred to the 
median or mode unless it is felt to be much less representative 


23456789 10 
Number of heads 
Fie. 30.—Graph of symmet- 
rical frequency distribution of 
Table 29. 


of the series than they are, or 
unless, because of open-end class 


| | intervals, the mean cannot be 


calculated. When we have to deal 
with a series of ranked items, rather 
than measured values, however, 
only the median applies. 

A frequency distribution is exact- 
ly balanced along the perpendicu- 
lar erected at the mean. The sum 
of the deviations of а series of 
values from their mean with regard 
for signs, t.e., the algebraic sum, is 
always zero. This is not true of the 
other averages except in perfectly 
symmetrical distributions, where 


the mean, median, and mode all coincide (see Fig. 30). A dis- 
tribution is symmetrical when equal frequencies occur at equal 
distances above and below the mean, as in Table 29 and Fig. 30. 


TABLE 29.—Disrripution or Ехрестер NUMBER or HEADS FROM 1,024 
ТоззЕз or 10 Corns Елсн 
Heads among 10 Coins Tosses of 10 Coins 


© с MONouraonro 


On the other hand, when signs are disregarded, the sum of the 
deviations is least in the case of the median. 
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Typically, in distributions that are not symmetrical or bell- 
Shaped, but skewed, i.e., extending farther on one side than on 
the other, the mean is pulled farthest in the direction of the 
skewness (because of its sensitiveness to extreme values), the 
mode is nearest the end of the scale opposite the direction of 
the skewness, and the median falls somewhere in between the 
other two (see Fig. 31). Indeed, in moderately skewed dis- 
tributions, the median is generally about one-third of the dis- 
tance from the mean to the mode, a fact utilized in formula (2) 
above. If the three averages are calculated for the skew dis- 
tribution of Table 28 below age 75, using formula (2) for the mode, 
they will be found to fall in this way (M = 29.27; Md = 27.34; 
Mo = 23.48). 


MoMa М M: MaMo 
Fra. 31.—Skewed frequency distributions. 

The usefulness of any average usually depends upon how 
representative it is of its distribution or series, i.e., upon what 
Proportion of the items in the series is close to the average. 
Although it is mathematically possible to calculate the mean, 
median, or mode for any series, the concept of the average as 2 
value representative of the series has muh more validity in the 
case of some series than of others. It is maost valid for symmet- 
tical distributions, and least valid for distributions shaped like 
the letter J (or reversed J), or the letter U, illustrated in Table 
30, cols. (2) and (3), respectively, and Figs. 33, 34. In the case of 
Jand U shaped distributions, any average is likely to conceal more 
important information than it reveals, and for this reason it is 
usually advisable not to compute averages for distributions of 


TABLE 30.—AGE DISTRIBUTIONS (HYPOTHETICAL Data) 
(1) (2) (3) 


Years of age 7 f у 
0- 4.9 21 18 116 
5- 9.9 53 21 53 

10-14.9 116 47 18 
15-19.9 47 53 a 
20-24.9 18 116 
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such extreme types. Perhaps the mode is the best of the 
three averages in situations of this kind; but even its value is 
questionable. 

Enlarging on the last point, special precaution is necessary to 
avoid the use of averages to represent a group that varies widely 
within itself. Thus a single infant mortality rate for a county 
containing a large city and a rural area in which the rates are 
very different is likely to be not only meaningless, but misleading. 
This point must be kept constantly in mind in most statistical 
problems, e.g., the calculation of a correlation coefficient. The 
latter, which is an average, may indicate а moderate amount of 
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0 0 
О 5 10 Б 20 25 0 5 10 I5 20 25 0 5 0 I5 20 2 
Х= Years of age X=Years of age X=Years of age 


Fic. 32. Fra. 33. Fic. 34. 


Fig. 32.—Graph of roughly symmetrical frequency distribution of Table 30, 
Col. (1). 


Fic. 33.—Graph of J-shaped frequency distribution of Table 30, Col. (2). 
Fra. 34.—Graph of U-shaped frequency distribution of Table 30, Col. (3). 
relationship over the whole table, whereas actually there is no 
relationship at one end of the table and a close relationship at 

the other (see Chap. X). 

It should be noticed that an average, usually the mean, may 
sometimes legitimately be used for the purpose of resolving ® 
series of values into a single composite value, whether the latter is 
“representative” of the values in the series or not. This is the 
case when the chief interest lies merely in comparing the com- 
posite values of two or more series, as the mean size of income of 
all workers with the mean size of income of unskilled laborers 
alone. 

In most cases, it is important to exhibit the table of the fre- 
quency distribution as a whole, so that the distribution of the 
items, as well as their averages, may be known to the reader. 
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It is also а practice in doubtful cases to present all three averages 
side by side, so that their differences may be seen. This, how- 
ever, may merely throw upon the reader the responsibility ої 
choosing an average. 
ub s Geometric Mean.—In averaging a series of numbers 
SÉ ear an approximately constant ratio to one another, like 
› $, 8, 16, none of the three averages described above is as 
appropriate as the geometric mean. The geometric mean is 
тя to average any series in which changes are expressed as 
ad rather than as absolute differences. It is also preferable 
E averaging some skewed distributions, since it gives less 
veight to extreme variations than does the arithmetic mean. 
E geometric mean is always smaller than the corresponding 
Pan metic mean. ~ When a series contains а zero or negative 
DW e its geometric mean cannot be found. Just as the sum 
is с! plus deviations is equal to the sum of the minus devia- 
A a rom the arithmetic mean, so the product of the ratios 
he values smaller than the geometric mean to the geometric 
C is equal to the product of the ratios of the geometric mean 
to the values larger than the geometric mean (e.g., the geometric 
mean of 5, 8, 10, and 12 is 8.3, and E x E = 83 x 82), Also, 


Corresponding to the fact that when each member of a series is 
he series the sum of the 


нса by the arithmetic mean of t 
e is not changed (e.g.,8 4- 7 +5 = 15,and5 +5 +5 = 15), 
› When each member of a series is replaced by the geometric 
T the product remains the same (6.7; 12 X 34 X 4 = 1,682, 
* 11.7735 x 11.7735 X 11.7735 — 1,632). 
ds x an ungrouped series of values Xi, X2, - 
he geometric mean is 


а- хх: Xe (14) 
“+ XA, (15) 


nt,! is the corresponding 
is most conveniently 


_. Xn the formula 


For Srouped data, 
gres а 
Ww i р 
HR X1 is a mid-point and fi, its expone 
SS frequency. Computation, however, 


1 + 
Exponent means the power to which X is raised, e.g., (X)*. Here the 


ex T 
Ponent is 2, the second power. 
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done by means of logarithms, using the respective formulas: 


1 n 
log G = = > ney We (16) 


1 
logG = 5 — filog Хь, (17) 


where N = > fe 
1 


To illustrate the use of formula (16), the geometric mean of 
the rates in col. (4) of Table 32 below is found from a table of 
logarithms:! ' 
log Œ = 7(log 0.015 + log 0.058 + log 0.061 + log 0.047 

+ log 0.029 + log 0.011 + log 0.001) 
7(8.17609 — 10 + 8.76343 — 10 + 8.78533 — 10 

+ 8.67210 — 10 + 8.46240 — 10 + 8.04139 

— 10 4- 7.00000 — 10) 


1(57.90074 — 70) 
8.27153 — 10 


G — 0.019 
Notice that the geometric mean obtained by formula (16) is un- 
weighted, 2.e., each rate is given equal weight. The unweighted 
arithmetic mean of the same rates is 0.03171, while the weighted 
arithmetic mean rate, from cols. (2) and (3) of Table 32, is 
26,326/790,193 = 0.03331. 
The total column of the table in Exercise 1 below shows a 
skewed distribution, so that the geometric mean should be more 
representative of it than the arithmetic mean. By formula (17), 
log G = 3$1(73 log 2 + 96 log 6 + 101 log 10 + 48 log 14 
+ 52 log 18 + 21 log 22) 
= 51173(0.30103) + 96(0.77815) + 101(1) + 48(1.14613) 
+ 52(1.25527) + 21(1.34242)] 

= g31(21.97519 + 74.70240 + 101.00000 + 55.01424 
+ 65.27404 + 28.19082) = 0.88531 

G = 7.68 
The arithmetic mean is 9.72 and the median is 9.05. Formula 
(17) gives a weighted geometric mean, or geometric mean of a 
frequency distribution. 


1 See Appendix, Table 7, and accompanying explanation. 
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Notice the application of the geometric mean to the problem 
of estimating the population midway between two decennial 
censuses. 
millions is shown at 10-year intervals from’1790 to 1940. When 


In Table 31 the population of the United States in 


TABLE 31.—POPULATION or THE Unirep SraTES, 1790-1940 
(In millions) 

Year Population Year Population 
1790 3.93 1870 38.56 

1800 5.31 1880 50.16 

1810 7.24 1890 62.95 

1820 9.64 1900 75.99 

1830 12.87 1910 91.97 

1840 17.09 1920 105.71 

1850 23.19 1930 122.78 

1860 31.44 1940 131.41 (prelim.) 


T HN MEME I o0 - LES 


these figures are plotted, we get the absolute growth curve 
Shown in Fig. 35. Now suppose it is wanted to estimate the 


1790 


Fig. 35.— Absolute growth of population, 
Population in 1795, midway between the census 

If we take the arithmetic mean of the 
and at 1800, we have 


оно 
oo 


Population in million: 


зага 


oo 


United States, 1790-1940. 


es of 1790 and 
populations at 


5.31 +3.93 _ 4 69 millions. 
2 
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This evidently assumes that the absolute amount of population 
increase is the same over equal periods of time, since 


4.62 — 3.93 = 0.69, 


and 5.31 — 4.62 = 0.69. From Table 31, however, we see 
that the differences, 5.31 — 3.93 = 1.38 and 7.24 — 5.31 = 1.93, 
are not equal, and this is borne out by inspection of Fig. 35. 
Actually, around the dates 1790 and 1800, as far as we can judge 
from the given data, the absolute growth in population was 
increasing. Under these conditions, the growth curve between 
1790 and 1800 would probably be concave, as shown by line a 
in Fig. 36. The population in 1795 would then be somewhat 
less than that found by the method of the arithmetic mean, 


Arithmean | }Geom. mean 


1790 1195 1800 

Fig. 36.—Probable trend of population growth in the United States, 1790-1800. 
which implies a straight line rather than a concave trend (line b 
in Fig. 36). On the simple assumption that the rate of annual 
increase was constant between 1790 and 1800, the growth curve 
will be concave, and the geometric mean will give the exact 
population in 1795. The geometric mean is, therefore, usually 
regarded as the logical average to use when the growth curve is 
concave. The formula may be written 


Р = VP Pio = (РР), (18) 
where P is the population midway between the two censuses, Po 
is the population at the first census, and Р1 is the population at 
the second census. Substituting in this formula, 


Р = 4/3.98(5.31) = 4.57 millions. 

The geometric mean is not a suitable average, however, when 
the absolute amount of change is less each decade, as happened 
between 1930 and 1940. The growth curve is then convex (like 
c, Fig. 36), so that both the arithmetic and the geometric means 
give too low estimates of the population midway between 
censuses. 


S 
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; If we wish to calculate the constant annual rate of population 
ан that was assumed in finding the geometric mean (= 4.57) 
above, we apply the formula 


PN 
r= (E: = 1 (19) 


И, аз before, P, = 3.93, Р, = 5.31, and п = 10, we lavo 
5.31\* 
= yu es 
r e 1 = (1.35)* — 1. 
Ву logarithms, 
А log (1.35)% = 4 log 1.35 = 35(0.13033) = 0.013033. 
о 
(1.35)% = 1.03, 
тт = 1.03 — 1.00 = 0.03. 
» e 18, in finding the geometric mean we assumed that the 
opulation increased at the average rate of about 3 per cent per 
Year between 1790 and 1800. 
5 or the same problem, the arithmetic mean gives a rate of 
31 — 3.93 
вн“ 
2298010) 
Үег ће 10-уеаг period would result in a рори 
Pio = Ро(1 + т)". 
Pio = 3.93(1.035) ^. 
log (1.035)19 = 10 log 1.035 = 10(0.01494) = 0.14940. 


= 3.5 per cent, which if assumed to be constant 
lation in 1800 of 
(20)* 


10) 
and (1.035) = 1.411, 
or y» = 3.93(1.411), 


wh Ро = 5.55 millions, 
е 
reas, actually, Pio = 5.31 millions. 
ір 
ity pe (19) and (20) may be derived as follows: 
е оп Jan. 1, 1930, 


n Jan. 1, 1931; ete. 
е. 


Ро = Population of the stati 
P, = Population of the state o 
T = constant annual rate of increas! 


For remainder of footnote see page 114. 
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7. Population Rates.—The ratio of divorces to population, say 
3.2 per 1,000, is an illustration of the kind of rate that is important 
for sociologists. Other examples are the crude birth rate (births 
per 1,000 population per year) and the crime rate (say, convic- 
tions per 1,000 males 10 years old and over per year). А rate 
shows the amount of one variable per given amount of another 
variable e.g., the number of births in relation to a given number 
of women of child-bearing age in a population. 

In working with population rates, such as marriage rates, 
death rates, etc., it is helpful to have in mind what is meant 
by a rate. Mathematicians define a rate as ihe amount of 
change in a function (dependent variable) that occurs per unit 
change in the independent variable. The rate of travel of an 
automobile is the number of miles by which its position in space 


Then 


Pı = Ро + Рог = Ро(1 + т), Pa = Р, + Per 
P2=Pi+Pir = Pol + г)? + Po(l + т) 
= Ро(1 +r) + Poll +7)r = Ро(1 + г)*(1 +r) 
= Ро(1 + г) (1 +7) = Ро(1 + г). 
= Ро(1 +r)’, 
Similarly, 


Р = Ро(1 + 7)? 
If n = number of years between censuses, 


P, = Ро(1 + т)", 


ог 
+)" = m 


log (1 +7)" = log e 

nlog (1 +7) = log oe 
log (1 +7) = ilog р> 
log (1 +7) = log C | 


ог 


‚= - GP 


| 
r= (By | 
| 


——-—— 
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(function) changes per change of 1 hour in time (independent 
variable. How does a marriage rate fit the mathematical 
idea of a rate? The usual form of the marriage rate is the 
number of marriages per 1,000 population per year. Here, 
however, there are three variables instead of the two mentioned 
in the mathematical definition of a rate. Which of these is the 
function, which is the independent variable, and how is the third 
variable to be interpreted? The time element is usually regarded 
as the independent variable in rate problems, and the factor 
that varies with time, as the function. In our example, both 
the number of marriages and the size of the population base may 
change from year to year. Either of these alone related to time 
Would give а mathematical rate. But we are not interested in 
Such a rate, Rather, we want to know how the ratio of marriages 
to total population changes with time. It is, then, this ratio 
that is the function in our marriage rate. i 
. In the case of the marriage rate, we are primarily interested 
In the annual changes in the number of marriages, and not in the 
Change in the population base. The only reason for introducing 
"€ population base at all is to eliminate it as а cause of change 
€ number of marriages, so that the annual change in the 
number of marriages may be comparable from one population to 
another, 
S raises an important point. Is the population base p 
th Y factor that needs to be eliminated or controlled in ог a 
at the marriage rate may mean just what we want it to? In 
кот to have the marriage rate as comparable as possible p 
Де population to another, should we not also control the fac E 
age and sex composition, so that their influences are € 
iem the rate? That depends on the question we wan А 2 
d If our question is, ers ма 
| has the higher marriage ratio, regardless of the But if 
Ман bs do not control age = ies 
18^ to know which of the populations “+ 
de Mage rato if their age ani к= distributions were the "e 
A must control age and sex. This leads us to б ou 
"8рес\йс, gross, and net marriage rates for females. 


Or } 25, we п ‘nato 
ote as 2 inciple that the denomina T 
азе general princip " 


ai со: 
of the final rate should ideally © marriage, the group 


ро, 
sed to the event (e.g., if the event 18 


^ 
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exposed should be composed exclusively of, say, unmarried 
females), while the numerator should contain the number of 
events (e.g., marriages) occurring in the year. In the case of 
most crude rates, like official birth and marriage rates, this 
principle is disregarded. 

When marriage or other rates are plotted, it is usually advisa- 
ble to plot them on semilogarithmic paper, in order that the rate of 
change may be shown by the steepness of the graph. Plotting the 
rates directly on semilogarithmic paper is equivalent to plotting 
the logarithms of the rates, which in turn is similar to plotting 
the percentage of change in the rates from year to year (see Chap. 
VI, Fig. 19). 

It may be of interest to compute two of the most important of 
the refined rates used in current vital statistics. The gross 
reproduction rate is defined as the average number of girls born 
per woman passing through child-bearing age, say 15 to 50 years, 
without mortality, and exposed to the birth rate of a given year. 
The net reproduction rate is simply the gross reproduction rate 
corrected for mortality. In Table 32 these rates have been found 
for Wisconsin, with the year 1934 as the base. The gross rate 
appears as the total of col. (5), and the net rate as the total of 
col. (7). It is seen that each 1,000 women born, if none died and 
all were subjected to the average age-specific rates of 1934, would 
bear 1,110 daughters. However, if these 1,000 women were 
exposed to the death rates found in an appropriate life table, 
they would bear only 995 daughters to start the next generation. 
Since the actual distribution of women by age groups was 
eliminated as a factor in Table 32 from col. (4) on, it is possible 
that there may be a disproportionate number of young females 
in the population of Wisconsin in 1934 and that this may prevent 
the population from actually declining for a time, even though 
the net reproduction rate 15 less than 1. But if the net reproduc- 
tion rate of 1934 should continue until the age distribution was" 
stabilized, the female population of the state would then begin to 
decrease at the rate of 5 per 1,000 per generation. As a matter 
of fact, the birth rate was unusually low in 1934 on account of 
the economic depression and has since risen somewhat. The 
average birth rate over a period of, say, 3 to 5 years furnishes a 
more stable base than the rate for a single year, and for some 
purposes should be preferred. 
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TABLE 32.—Gross лмо Мет Repropuction Rares IN Wisconsin, 1934 
Average 


к Average 
Females, | Female pegs daughters mene daughters 
A ~ pE orn per | born to | survival 
ge 15-49, live born {оа 
groups July 1, | births due female} төз female 
1934 | 1934 | 15-19 [154% in) from survivin 
1934 | &year | birth Е 
К {о аде 50 
period 
а) (2)* (t [OH (5)§ (61 mT 


15-19 | 139,600 | 2,147 | 0.015 | 0.075 |0.92512 | 0.070 
20-24 | 131,369 | 7,599 | 0.058 | 0.290 | 0.91480 | 0.265 
25-29 | 118,042 | 7,256 | 0.061 | 0.305 | 0.90117] 0.275 
30-34 | 108,496 | 5,090 | 0.047 | 0.235 | 0.88626 | 0.208 
35-39 | 103,165 | 2,987 | 0.029 | 0.145 | 0.87016 | 0.126 
40-44 | 101,089 | 1,147 | 0.011 | 0.055 | 0.85071 | 0.047 
45-49 | 88,432 100 | 0.001 | 0.005 | 0.82522 | 0.004 
Шоба... 790,103 | 26,826 | ^ — — | T-M0- |^ Жа ШОШО 


* Estimated from the 1930 census with the aid of a life table for Wisconsin. 
Pu Found by applying the percentage of total births, female, in 1934 to total live births, 
Trected for underregistration. 
$ Column (3) divided by col. (2). 
$ Column (4) multiplied by 5, since а woman in any 


d many daughters in cach of the 5 years as in 1934. 
| Taken from Life Table for White Females in Wisconsin, 1929-1931, prepared by the 


etropolitan Life Insurance Company. 
Column (5) multiplied by col. (б). 
Exercises 
: Nors: A calculating machine will save time in solving the problems 
ìn this text, At least the student should own an inexpensive slide rule. 
, 1. a. Find the crude mode, where appropriate, of each of the follow- 
ing four series, and of all four combined, using formula (1): 


Даг or CHILDREN IN Four THREE-GENERATION KINSHIP Gnovrs 


5-уеат age group is assumed to bear 


Age of child, Number of children in kinship group Total 
ves children 
х: І п ш IV 
ESSO eor | Gt (у) (2 (fo (f) 
2 10 18 30 18 73 
6 21 20 29 26 96 
10 33 36 18 14 101 
14 20 10 6 12 48 
18 12 7 5 28 52 
а 4 1 3 13 21 
и... 100 89 91 ПІ 391 
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b. By use of formula (2), find the crude mode of the total series of 
ages. 


2. a. What is the median of each of the six series below? 


Момвев or Persons РЕВ Broxen Номе 


Set I Set IT Set III Set IV Set V Set VI 
3 3 3 3 3 3 
5 5 5 5 2 1 
4 4 4 4 4 2 
1 1 2 2 1 4 
6 6 6 1 5 5 
8 8 8 4 8 8 
2 2 2 6 11 4 
11 111 1 4 5 11 
12 


b. What is the median of series IV and VI combined (added by rows)? 

№ тв: These series contain too few cases for the medians to have much 
meaning; they are useful only for practice in finding the median. 

3. а. Calculate the median of the two frequency distributions below: 


PERCENTAGE оғ CHURCHES WITHOUT A FULL-TIME MINISTER IN THE RURAL 
Counties or Two REGIONS 


Percentage Region I, Region II, Regions I and II, 
of churches (X*) | counties (fi) counties (f) counties (f) 
2.5 22 4 26 
7.5 94 18 112 
12.5 221 26 247 
17.5 85 17 102 ` 
22.5 67 25 92 
27.5 39 14 53 
КОШЫ» irae л 528 104 632 
* Mid-point. 


b. What is the median of the two distributions combined? How 
does it compare with the mean of the medians of the two separate dis- 
tributions? What is the meaning of the mean of the medians? 

4. The rural counties in 15 states were scored on various points, such 
as percentage of homes with telephone, per capita expenditure for 
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s and 80 on, and the median score for the counties in each state 
cone giving 15 medians. It was then wanted to know the 
AM е of all counties in the 15 states together. How would you 
5. In the table below, what i i i 
К , what is the arithmetic mean of (a) the popula- 
tions of the counties? (5) the birth rates? 


Birth rate per 


County docs О 1,000 population 

AL (Х:) 

1 8,003 19.5 
2 21,054 24.5 
3 34,301 21.1 
4 15,006 9.8 
5 72,573 23.1 
6 15,330 16.4 
7 10,233 17.4 
8 16,848 12.6 
9 37,581 21.2 
10 34,165 16.7 
11 30,503 19.1 
12 16,781 21.6 
13 119,217 18.3 
14 52,745 14.0 
15 18,182 18.6 
16 46,583 16.9 
17 27,037 17.3 
18 42,565 22.1 
19 3,815 15.5 
20 59,928 16.9 
21 11,471 25.2 
22 38,469 19.9 
23 21,953 18.1 
24 13,913 13.5 
25 20,039 19.2 

ee Не 


6. Find the mean of the following table by the short method, and 


check it by changing the assumed mean. 
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Werexty WacEs Recervep BY 500 Women EMPLOYED IN A GARMENT 


Factory 
Weekly wages, Women 

(X) (f) 
$2.50- 3.49 5 
3.50- 4.49 71 
4.50- 5.49 126 
5.50— 6.49 132 
6.50— 7.49 98 
7.50- 8.49 47 
8.50- 9.49 23 
9.50-10.49 9 
"оба им. m tl 


T. What is the mean of the table below? 


) ANNUAL NET Incomes or 150 Louisiana Corron Farms, 1936 


Income Farms 

(X)* Qt 

$ 500 62 

750 45 

1000 23 

1250 8 

1500 6 

1750 2 

2000 2 

2250 1 

2500 1 

Това er ees NA E 150 
* Mid-point. 
+ Frequency. 


8. Caleulate the mean number of years on farm reported by Iowa 
farmers in 1929. Use deviations from an assumed mean. 


Iowa Farm OPERATORS CLassiFIED ÁCCORDING TO NUMBER OF YEARS ON 


Farm, 1930 
Years on farm Filius 
(X) (f) 
nder ES C Meer Gamo So dde SU S dub ufus 25,625 
ета ое. aa E Step Ж erase г 20,140 
2 {о 4 уеатз... 36,496 
5 to 9 years....... 33,465 
10 years and оуег*.. è 92,142 
обат: EE Аё iodine E Hr yt ра E ЫА: 207 ,868 


(Abstract of the Fiftcenth Census of the United States, 1930, p. 582) 
* Take the mid-point of this interval at 15 years. 


9. The arithmetic mean of the number of years on farm reported by 
249,588 Alabama farmers in 1930 was 6.1. What is the mean number 
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of years report i i t 
ET сан by Iowa and Alabama farmers combined, using the 
3 10. The counties of Oklahoma are to be grouped according to their 
infant mortality rates in 1939 as published by the Oklahoma Bureau of 
Vital Statistics, with the purpose of correlating these rates with the 
per capita expenditures for public schools. Have you any criticisms 
of this method? 

E 11. A writer on the family recently made this statement: “The Census 
à е for 1930 showed the average size of the American family to be 
3.81 persons. But averages tell us little.” Can you suggest any 
important information that this average conceals? 

12. Can you propose a refinement of the crude marriage rate analo- 
EN to the gross reproduction rate described in the text? How would 
it differ in meaning from the present crude rate? 
fes Calculate the net reproduction rate for your state, and explain 
г meaning. Is the population of the state increasing or decreasing 

Present? If the answer to this question seems to contradict the net 
Teproduction rate found, can you reconcile the difference? 

14. What do you consider to be the most meaningful base for a divorce 
rate and why? ‹ 

15. At what mean rate did the population of Nashville, Tenn., increase 
between 1880 and 1890? Between 1920 and 1930? Plot the observed 
Populations first on rectangular coordinate paper, then on semiloga- 
rithmic paper, and study the differences. 


POPULATION OF NASHVILLE, TENN., 1870-1930 


Census Population 
1870 25,865 
1880 43,350 
1890 76,168 
1900 80,865 
1910 110,364 
1920 118,342 
1930 153,866 


16. Using the data of Exercise 15, compare the geometric and arith- 
metic mean populations of Nashville, Tenn., between 1870 and 1930, and 
Plot them in the graphs prepared in Exercise 15. Explain the results. 
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СНАРТЕВ, УШ 
MEASURES OF DEVIATION AND PARTITION 


1. Deviation from an Average.—lIt is seldom possible to give 
a good idea of a series of ungrouped values or of a frequency dis- 
tribution by means of a single value, or average, alone. It is 
generally wise to exhibit the whole distribution in tabular form, 
and often to show it graphically as well. Mention of the range of 
the values, t.e., the highest and lowest values in the series and the 
difference between them, is desirable. It is also important to 
accompany the average with some measure of variation or dis- 
persion. The purpose of a measure of dispersion is to show the 
extent to which the individual items in a series vary from their 
average. If the average value of the items is known, and also 
the amount by which a certain proportion of the items deviate 
from that average, a rather satisfactory idea of the distribution 
may be conveyed. For example, note the ungrouped items 4, 1, 
6, 7, 3, 9, 2, 1, 3, 4, representing the number of years between 
marriage and divorce in the case of 10 divorced couples. Their 
mean is 4 years. Six out of the 10 cases do not differ from the 
mean by more than 2 years. If, therefore, we describe the 
distribution to the reader by saying that the mean time between 
marriage and divorce is 4 years, and that three-fifths of the cases 
do not deviate from the mean by more than 2 years, he should 
have а better notion of the distribution than if we merely told 
him to imagine 10 couples whose mean time between marriage 
and divorce was 4 years. 

2. The Average Deviation.— The simplest of the measures of 
dispersion is obtained by finding the amount by which each item 
deviates from the average value, adding these without regard to 
sign, and dividing the sum by the number of items, to obtain 
the average amount of deviation. Such a measure of deviation 
or dispersion is appropriately called the average deviation, and 
is often represented by the symbol A.D. 

In the case of ungrouped data, like the above series, 4, 1, 6, 
7, 8, 9, 2, 1, 3, 4, representing ce аа of years between mar- 


MEASURES OF DEVIATION AND PARTITION 123 


riage and divorce for 10 divorced couples, the average deviation 
from the mean value of 4 years is found as shown in Table 33. 


TABLE 33.—CompuraTions FOR THE MEAN DEVIATION, UNGROUPED DATA 


Х-М,*=х 
4—4-20 

1—4-2-3 
6-4= +2 
7—4 = +3 
3-4 = –1 
9—4 = +5 
2-4 = -2 
1-4 = –3 
3-4= -1 
4-4= 0 


10 
У = 201 
* Mz indicates the mean of the X values. ro 


tThe Блез | | indicate that signs are disregarded. means to add the 10 items. 


If we add the values of z with respect for the signs, the result is 
Zero. Disregarding signs, however, the total is 20, and 
A.D. = i$ = 2. 
That is, the 10 values differ on the average from their mean by 2 
years, 
A formula for use with grouped data is 


Pes HIE Av) _ г (21) 


where f is the frequency in any class interval, X is the value or 
mid-point corresponding to a given frequency, Av is the average 
used (mean, median, or mode—usually the mean), z = X — Av, 
and N is the number of items or the sum of the frequencies (f). 
The calculation of the A.D. from the mean, М, is illustrated in 
Table 34. In the table, the z's are obtained, of course, by 
Subtracting the value of the mean, 0.67, from each of the X 
values, 

There are short methods of finding the average deviation from 
the mean or median, but they are rather cumbersome and will 


Dot be described here.! 


Statistics for Students of Psychology and 


1 See, for example, Н. Sorenson, Tue. New York, 1936 
ne., , 1936. 


Education, p. 137, MeGraw-Hill Book Company, 
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TABLE 34.—N UMBER OF Previous ÁRRESTS RECORDED FoR 100 MURDERERS 


The average deviation is usually smaller when taken from the 
median than when taken from the mean or the mode. 

3. The Standard Deviation.—Because the average deviation 
disregards negative signs, another measure of dispersion, known 
as the standard deviation, has been devised, which is free from 
this objection. It is found by subtracting each X value, or 
in grouped data each mid-point value, from the mean of the X 
values, squaring these differences to make all signs positive, 
multiplying them by their respective frequencies, summing them, 
dividing by the sums of the frequencies, and extracting the 
square root. 'The formula for the standard deviation is, there- 


fore, 
сил, б 
on 
и м 
Letting т = X — M; 
с = NES (24) 
or aie ; 
= А =, (25) 


* The Greek letter, small sigma, о, is conventionally used to represent the 
standard deviation. 


nma 


EE = mer ш 


— _ ЖЕ 
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where с is the standard deviation of the X values, X is the value 
of an item or the value of the mid-point of a group of items, f is 
the frequency of the items in a group or class interval (for 
ungrouped data, f = 1), and N is the number of items, t:e., 
N= Zf. 

To save labor in computing the standard deviation for a large 
frequency table, a short method is commonly used: 


zu [XH | 
TENE: Gr) (26) 


where d is the deviation of the mid-points from a guessed mean in 
class interval units, ¿ = width of class interval. This formula 
may also be modified for use with ungrouped data by taking the 
assumed? mean at zero, so that d = X, f = 1, andi = 1: 


2 DXN: 
c= ae (20) “| 
ог | 
УХ? 
ро сете (28) 
oi N 
! Derivation of formula (26): 
By definition, 
JA- M}, 23 
c AIEE Mar (23) 
From Chap, VII, formulas bs and (13), 
2A dd p 
izfd, (b) 


М=А+у 


Substituting from (a) and (5) in (23) 
r- р (а+и-а-5)' © 
c= E 00 
ам) 
СЕ БЛ A z A) ; 
\ Кш) vie ; (26) 
ea - Gr) : 


c 


а 
Ш 


с 


* Вее Chap. VII. 
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In Chap. VII, we had the ungrouped data, 3, 7, 2, 12, 1, 16, 4, 
representing the numbers of children in seven Italian immigrant 
families. The mean number of children per family was found 
to be 6.43. What is the standard deviation? If we use the 
long method of formula (22) or (24) above, we require the com- 
putations shown in Table 35. 

TABLE 35.—CoMPUTATIONS FOR THE STANDARD DEVIATION, UNGROUPED 


Dara 
(Long method) 


2.9 X – М, (X — М.) 
3 3-6.43 = —3.43 | (—3.43)? = 11.76 
7 7-6.43 = +0.57 | ( 0.57)? = 0.32 
2 2-6.43 = —4.43 | (—4.43)? = 19.62 
12 12-6.43 = +5.57 | ( 5.57)? = 31.02 
T 1-6.43 = —5.43 | (—5.43)? = 29.49 
16 16-6.43 = +9.57 | ( 9.57)? = 91.58 
4 4-6.43 = —2.43 | (—2.43)? = 5.90 
ШОШ iners] ака sa UE ЕСЕН Б» 189.69 


Substituting in formula (22), 
c= Ee = 521. 


For the short method of formula (27) or (28), we need only the 
two totals, as shown in Table 36. 


TABLE 36.—Compurations ков THE STANDARD DEVIATION, UNGROUPED 
Data 
(Short method) 
X? 


3 
7 
2 
12 144 
1 
16 
4 
45 479 
Substituting in formula (27), 


с = 5.21, 
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as before. The saving of labor in comparison with the first 
method is evident. 

Let us next find the standard deviation of Table 34 above, 
and compare it with the average deviation previously obtained 
for the same table. We shall again first employ the long method, 
to clarify the meaning of the arithmetic, and to enable the 
student to compare the amount of work required relative to the 
short method to follow. The formula that describes the long 
method for grouped data is formula (23) or (25), which calls for 
the computations shown in Table 37. The mean of the table is 
0.67. 


TABLE 37.—Compuration or STANDARD DEVIATION FOR TABLE 34 


(Long method) 


x 7 X — м, = 21/0 — М.) = ИХ — Mz)? =} 
д 

0 60 —0.67 —40.20 26.93 

1 20 +0.33 + 6.60 2.18 

2 15 +1.33 +19.95 26.53 

3 3 +2.33 + 6.99 16.29 

4 2 +3.33 + 6.66 22.18 
оар 100 ТЮЙ 0.00 94.11 


Substituting in formula (23) or (25), 


[94.11 
= i 10:07 
©5100 A 


ethod of formula (26), the steps are 


Turni hort m 
, ее otice the so-called Charlier check 


worked out in Table 38. N 
N FOR TABLE 34 


TABLE 38,—Compuration OF STANDARD DEVIATIO: 
(Short method) 


= ; ? Р mo | sa +0) 
0 60 = —60 60 0 
i a T 0 0 20 
2 15 +1 ас 15 = 
3 3 +2 Fo as x 
4 2 38 jump акле s 

105 139 
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included in Table 38: Zf + 23fd + Ја? = Zf(d-- 1)» or 
100 + 2(—33) + 105 = 139, which is the total of the last 
column of the table. This checks all of the work of the table. 
Substitution in formula (26) now gives 


4 = 1, [105 (—33V 
dx 100 100 J' 
в = 0.97, 


which is the value reached by the long method. 

The average deviation of Table 34 was found in Sec. 2 above to 
be 0.804, while we see that the standard deviation is 0.97. The 
standard deviation is always larger than the average deviation, 
because squaring the differences gives greater weight to the 
extreme values. 

Because of the inaccuracies due to grouping data in class 
intervals, the standard deviation squared, called the variance, 
of a distribution that is fairly symmetrical? in form is often cor- 
rected by subtracting from it the value 72/12 in the case of a 
15 = 2) in the case of a discrete 
variable. In the above problem the variable is discrete, so 
that we have (0.97)? — (43 — 45) = (0.97)?, and о remains 
unchanged. There is no error of grouping when the variable is 
discrete and i= 1. This correction is known as Sheppard’s 
correction. In its usual form it cannot be applied to very skewed 
or asymmetrical distributions. 

If we have calculated the standard deviation of each of two 
series, and then wish to know the standard deviation of the two 
series combined, the latter may be found from the formula 


Е АСЕ Ат UE 


continuous variable, ог ( 


(29) 


where the subscripts differentiate the two series, and no sub- 
script indicates the combined series. Where there are more than 


1 In Table 38, it happens that the X values are already in unit step devia- 
tion form—0, 1, 2, etc.—so that very little labor is saved by using the d 
column. We might, therefore, have used X in place of d in formula (26). 
The student is asked to do this as a check on the calculations in Table 38. 

2 The distribution should be normal in form. See Chap. IX. 
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two series, a term N;(s;? + M;?) is inserted in the formula for 
each additional series. 

For example, for Table 34 we have №; = 100, cı? = 0.94, 
and M? = 0.45. In a second sample of the same kind, given 
М» = 80, оз? = .6302, and М»? = 4.56. From formula (29), for 
the two samples combined, we find 


et [се + .45) + 80(.6302 + 456) _ 1,74, 


180 
в = 1.16. 


Just аз the average deviation is usually a minimum when 
taken from the median, so the standard deviation is a minimum 
when taken from the mean. In fact, the standard deviation is 
practically never taken from any average except the mean, and 
formulas (27) and (28), above, are valid only for the mean. 

4. Effect of Coding! on Averages and Measures of Dispersion. 
If the frequencies in a frequency table are divided through by a 
constant, k, the averages and measures of dispersion or partition 
calculated from the table will not be changed. Since it is 
Possible to simplify the computation in this way, it is desirable 
to use this device whenever the opportunity offers. 

1 'The student is asked to test this for himself, using Table 39, 
1n caleulating the mean and the standard deviation. 


TABLE 39.—Mran ANNUAL IxcowE or 500 CLERICAL Workers 


Mean Income Families 
(Х) () 
$ 500 25 
1,000 150 
1,500 200 
2,000 75 
2,500 50 
Е E 500 


It is also often convenient to reduce the absolute frequencies to 
Percentage frequencies before using them in computation. 

5. The Coefficient of Variation—The average or standard 
deviations of two frequency distributions are not directly com- 
Parable, because they depend upon the size of the mean or 
median in each case, and upon the particular unit used. For 
example, the weights of a herd of elephants may vary on the 


* Dividing the frequencies of a distribution by а constant. 
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average by 280 lb., while the weights of a litter of mice may differ 
by 0.1 oz. Yet the mice may show a greater variation than the 
elephants relative to their mean weights. Average and standard 
deviations may, therefore, be made comparable by expressing 
them as percentages of their means or medians. This percentage 
is called the coefficient of variation in terms of the average or 
standard deviation, and is written 


V P ы (30) 
j^ 100A.D 
У = MES (31)* 
and 
100c 
Vas (32) 


It is possible to use the coefficient of variation, V, as a measure 
of the representativeness of an average. It may be said, arbi- 
trarily, that when V is above 50 per cent, it is usually advisable 
to abandon the use of an average as a single value intended to give 
an idea of the central tendency of a series. The V calculated for 
the mean of Table 34 above by formula (30) is 


_ 100(0.804) _ 
У = = 120 per cent. 
In this case, V is 70 points above 50 per cent; hence the mean is 
obviously a poor device for representing the actual values in this 
very skewed or J-shaped distribution. If we apply formula (30) 
to the mean of Table 40, below, which is merely a rearrangement 


TABLE 40.—Previous Arrests RECORDED ror 100 MURDERERS 
(Frequencies of Table 36 rearranged) 


* Only one of these formulas should be used in the same comparison. 
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of the frequencies of Table 34 in more symmetrical form for 
purposes of illustration, we find that V = 100(0.467)/1.97 = 24 
per cent, indicating that the mean represents the values in this 
table very well. This result would be expected from an inspection 
of the distribution, which appears to be fairly symmetrical in form, 
with the largest frequency in the center. 

In using formulas (30), (31), and (32), № will be seen that if 
two distributions have equal average or standard deviations, but 
unequal means or medians, the one with the larger average will 
have the smaller coefficient of variation, V. This is as it should 
be, provided that the means or medians used in finding the V’s 
contain no element that spuriously raises or lowers the values 
from which the averages are calculated. 

Suppose that the question is asked, Does Table 34 or Table 40 
show a greater amount of variability from the mean? In the 
сазе of Table 34 it has been seen that V = 120 per cent, and for 
Table 40 it was found that V = 24. The V's, therefore, show 
that Table 34 is 12,2 = 5 times as variable as Table 40, whereas 
the average deviations would indicate that the former distribution 
was less than twice as variable as the latter. 

6. Partition Values.—To show the scale values below which 
any desired proportion of the frequencies in a distribution fall, a 
Set of partition values known as quartiles, deciles, etc., or more 
inclusively as percentiles, has been devised. These measures 
all employ the principle of the median, and apply primarily to 
grouped data. Thus, while the median is that scale value below 
Which half of the values fall, the first quartile, Qı, is the scale 
value below which lie one-fourth of the values; the third quartile, 
Qs, is the scale value below which lie three-fourths of the values; 
the ninth decile, ds, is the scale value below which lie 90 per cent 
of the values; the 65th percentile is the scale value below which 
lie 65 per cent of the values; and so on. It is, therefore, seen 
that each of these measures is merely & particular percentile 
Value, the median corresponding to the 50th percentile, the first 
Quartile to the 25th percentile, the third quartile to the 75th 
Percentile, The general method of finding any value is the oe 

Because of logical difficulties, it is seldom that any partition 


1 
Value except the median is found for ungrouped data. 
however, it is generally best to accept 


BIER 1 
the attempt must be made, even exact but imaginary сора 


rough approximations, rather than insis 
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For grouped data, it will be recalled that the median is located 
by dividing the total frequency, N, by 2, counting up the column 
of accumulated frequencies of the table until the lower limit of 
the class interval is reached which contains the median value, 
and then interpolating within this interval to determine the 
median value. When finding any percentile value other than 
the median, we need only change the coefficient of the total 


tions. For example, if we are required to furnish the third quartile, Оз, 
for the array of 12 ages—3, 5, 6, 9, 11, 16, 20, 21, 24, 25, 26, and 30 years— 
we may find the position 12 X 0.75 = 9, and say that 100 (у%һ) = 75 per cent 
of the ages are less than the age of 25 years that occupies 9 + 1 = 10th place 
in the array. This statement is correct in the present case; but it is not 
correct to say, further, that 100 — 75 = 25 per cent of the ages are greater 
than 25 years. If the age 24 years in the array were replaced by a second 
age 25 years, then the age 25 years would no longer be greater than 75 per 
cent of the ages, but it would still probably be the most appropriate age to 
offer as an approximate value for Q;. 

When the position, Np, found by multiplying the total number of items, 
N, by the given percentage value, p, is not a whole number, the matter is 
more complicated. "Thus, if we drop the age 30 years from the top of the 
above array, we have Np = 11 X .75 = 8.25. There is no 8.25th position 
in this array, so we have to choose between positions number 8 and 9, or 
else interpolate between them. If we take position 8 as the nearest integer, 
and add one to it, as we did above, we get position 9. The age correspond- 
ing to this position is 24 years, and we see that eight ages, or 100(ү{) = 72.7 
per cent of the ages, are less than this age. Since 72.7 per cent is rather 
close to 75 per cent, the age 24 years seems to be the simplest approximate 
value to assign to 03. 

Only when no actual position in an array gives a reasonably close approx- 
imation to the meaning of a required percentile is it usually worth while to 
interpolate between two positions. If our array above consisted of only the 
first 10 ages, to find Qs we would have pN = 0.75(10) = 7.5. The age in 
the eighth position is greater than 100(75) = 70 рег cent of all the ages, 
whereas that in the ninth position is greater than 100(5) = 80 per cent 
of the ages. Here we might prefer to take the interpolated position, 
LI = 7.5, so that, assuming continuous or grouped data, the theoretical 
age corresponding to it would be greater than 100(7.5/10) — 75 per cent 
of the ages in the array. This theoretical age, or value of Qs, must be 
halfway between age 20 in seventh position and age 21 in eighth position, or 
20 +21 

2 

Notice that, in ungrouped data, the empirical formula, Np + 1, used 
for locating the approximate integral position of such a partition value аз 
Qs, is replaced by the formula p(N + 1) for determining the median position. 


= 20.5 years. 
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frequency, №. For example, in the case of Qı, we use N/4, for 
Qs, 8N/4, for ds, 0.9N, for the 65th percentile, 0.65N, and so on. 
The general formula, using P to represent any percentile, median, 
decile, or quartile value on the X scale, is 


Р = (ЁЛ; (33) 


Where p is the percentile rank or point of division on the frequency 
scale expressed in percentage form (e.g., p = 0.75), L is the lower 
limit of the interval containing the pth value, N is the total 
frequency of the table, F is the sum of the frequencies falling 
below (ie, in class intervals with limits smaller than) L, f is 
the number of frequencies in the interval containing the pth | 
value, and i is the size of interval containing the pth value. 

Let us find the values of Qi, Qs, ат, and pss (33rd percentile) 
in Table 41. 
TABLE 41.—Disrrmution OF THE ESTIMATED INCOME AMONG UNMARRIED 

Women or THE UNITED STATES IN 1910* 


Income Women (У) Percentage 
(X) (У) Accumulated accumulated 
$ 100- 199 10 10 0.55 
200- 299 70 80 4.42 
300- 399 560 640 35.36 
400- 499 530 1,170 64.64 
500- 599 280 1,450 80.11 
600- 699 150 1,600 88.40 
700- 799 110 1,710 94.48 
800- 899 37 1,747 96.52 
900- 999 22 1,769 97.73 
1,000-1,099 16 1,785 98.62 
1,100-1,199 12 1,797 99.28 
1, 200-1 , 299 8 1,805 99.72 
1,300-1,399 5 1,810 100.00 

io ай 1,810 
A Е 
ed States, p. 224, 1915. 


* From W. т, Kino, Wealth and Income of the People of the Unit 
To find 0), we have 
pN = .25(1,810) = 452.5. 


Counting up (i.e., in the direction of increasing values on oe A 
Scale) the accumulated frequency column of the table, we se 
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452.5 lies in the class interval 300-399. Therefore, 


L = 300. * 
F= 80. 
f = 560. 
t= 100. 
Substituting in equation (33), 
- 452.5 — 80 
Qı = 300 + 560 100, 


That is, one-fourth of the women earned less than $366.50 a year. 
Similarly, 


» 1,357.5 — 1,170 | 
Qa = 500 + 0775, 725 - 100, 
or 
Qs = 567, 
- 1,267 — 1,170. 
d; = 500 + 2. 100, 
or 
d; = 534.6, 
k 597.3 — 80 
Pas = 300 + ==. 100, 
or 
Раз = 392.4. 


From these results we notice that three-fourths of the working - 


women made less than $567 annually, 70 per cent of them made 
below $534.60, and one-third made under $392.40. Of course, 
there is no point in calculating all of these values except for 
illustrative purposes. We are usually interested in such fractions 
as one-third, one-half, or three-fourths. 

An investigator often requires, not the value below which a 
certain percentage of the frequencies fall, but the reverse of this, 
namely, the percentage of the cases that falls below a certain 
value, that is, the percentile rank of the value. Referring back 
to the ungrouped array of 11 ages used above, viz., 3, 5, 6, 9, 11, 
16, 20, 21, 24, 25, and 26 years, we may require the percentile rank 
of the person aged 21. Since, by definition, this is equivalent to 
asking what percentage of the persons in the array are less than 
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21 years of age, we note that there are 7 persons out of 11 who 
are younger than 21 years, and compute тт = 0.636, or 63.6 per 


cent. We then say that the percentile rank of the person aged 


21 is approximately 64. 

Turning to grouped data, suppose we ask what proportion of 
the unmarried women of Table 41 earned less than some minimum 
living wage, say $550 a year. Our problem now is, knowing a 
value on the X scale, to find the percentage of values on the Y 
scale that fall below it. In the present case, it is evident that 
1,170 women earned less than $500, and that 280 earned between 


$500 550 — 500 E 
and $599. We have 600 — 500 (280) = 140, аз the 


number of women earning between $500 and $550. Therefore 
1,170 + 140 = 1,310 is the number of women who made less 
than $550. Expressed as a percentage of the total number of 


women workers, we find that 100 (£519) — 72 per cent of the 
, 


Women failed to earn as much as the minimum amount. А 


formula for this calculation 1s 
Е + = N ? ( 


where р is the percentile rank sought, P is the given X scale value, 


"18 the accumulated frequencies in the class intervals with 
limits smaller than those of the interval including P, f is the 
frequency of the interval including P, L is the lower limit of this 
Same interval, 7 is its width, and N is the total frequency of the 
table. Thus, substituting the values of the preceding problem 


In formula (34), we get 


280(550 — 500) | 100, 
D [ыт + дары 1,8 


От р = 72 рег cent, as before. d 
An X scale value corresponding to а ЕУП · RE 
quency, or a percentage frequency corresponding to a given 


Scale yg] ' ilv be found by means of a cumulative 
oe ale те Сһар. УТ. The student is 


Curve, which was described in Fig. 11, ; 
asked to use this device to check the arithmetical results just 
Obtained from Table 41 above, preferably plotting the curve from 


the last column of that table. 


n accumulated fre- 
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A measure known as the quartile deviation is sometimes used. 
The formula is 


Q- Qs = Qi (85) 
'Thus, for Table 41, 
Q= 567 zit = 100.25. 


The quartile deviation is employed only when the median is the 
preferred average. 

All these measures of dispersion—quartiles, deciles, percentiles, 
quartile deviation—are so-call- 
ed position values, and have the 
same advantages and disadvan- 
tages as the median, previously 

Q; Md. Q3 discussed. In particular, they 

Ета. 37.—The distance Q;-Q: in- are insensitive to extreme 

pli dee Hall ontho cases: values, and cannot be treated 

algebraically. They are especially useful in analyzing a skewed 

frequency distribution, since they maintain a definite relationship 
to the distribution, regardless of its shape. 

7. Comparable Measures or Scores.—When two frequency 

‚ distributions are of about the same shape, e.g., both about 
symmetrical, both slightly skewed in the same direction, both 
J-shaped, etc., distances on their scales are usually compared 
in units of their respective standard deviations. Thus, if we 
have the distributions of many scores on two independent tests 
of a given trait, for each test the deviations of the scores from 
the true! mean are divided by the true standard deviation, 
to get the desired standard scores. Given, for Test I, true 
mean — 70, true c — 10; and for Test II, true mean — 62, 
iruec = 12. If subject A scored 80 on Test I and 60 on Test IT, 

80 — 70 


his standard score on Test I is ИЕ 1, and on Test II is 
50—02 = —0.17; and his combined score оп the two tests is 
1 + (—.17) = 0.83. If subject B scored 75 on Test I and 65 
on Test II, his corresponding standard scores are BO — 0.50 


1 By true is meant a statistic derived from many applications of a test, 
rather than from a single application, to the same universe or type of subjects. 
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1 65 — 62 | 
on Test I, and ee = 0.25 on Test II; and his combined 


score is 0.75. 

Where two distributions differ markedly in form, e.g., one being 
about symmetrical and the other J-shaped, or one very peaked 
and the other flat, the standard deviations do not provide con- 
sistent units for reducing their scale distances to more comparable 
terms, because the proportion of frequencies included between 
the mean and one standard deviation on each side of it changes 
with the form of the distribution. Theoretically, perhaps the 
best procedure under these circumstances is to normalize both 
distributions, but the method is too complex to introduce here.’ 


A cruder but much simpler method uses the Q’s instead of the o’s 


as common denominators. Although © also has disadvantages, 
within which always falls 


Н is one-half of the range 9-0, 
36, middle half of the frequencies; and in that sense its interpre- 
tation is independent of the shape of the distribution (see Fig. 37). 
Suppose now that the distribution of many scores in Tests I 
and ТЇ above are quite different, being J-shaped to the left 
in Test I and skewed to the right in Test II. For Test I the 
true median score is 74, and the true Q value is 6; for Test II, 
the median score is 59 and Q is 8. We divide the deviations 
of the two subjects’ scores from the medians by the respective 


80-2 0-9 = 1.195 as the combined 


6 
— 5 — 59 : 
EE мон д Do ee M 
e called the Q scores. 


Combined score of subject В. These may b 

Instead of the standard scores or @ scores described above, the 
method of equivalent percentile scores may be used in the effort to 
make two independent scales comparable. For each scale, every 


Percentile or, say, every fifth percentile is found, and these values 
are arranged in two parallel series, where corresponding pairs of 
Values are regarded as equivalent. Thus, in Table 42 below, the 
Values X, = 13 and Xs = 0.1, are equivalent on the two scales. 
The percentile values are found arithmetically from the two 
given frequency distributions of scale values by formula (33) - 

*Pa *ni es from Distributions of Dis- 
similar Shape, st, Obtaining Compare til ‘Association, Vol. 26, PP- 

55-460, 1931. 


Q values, and get 
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above, or graphically from the ogive curve as illustrated in Fig. 
11, Chap. VI. Suppose that we wish to compare the score of 
subject 114 on Test Ху, 85, with the score of subject 17 on Test 
Xs, 2.6. From Table 42 we see that a score of 2.6 on scale X; 
is equivalent to a score of 93 on scale Ху. Hence the two 
comparable scores аге 85 and 93. If either or both of the scores 
of subjects 114 and 17 did not appear in Table 42, we would 
find the percentile rank of say the second of them by formula 
(34) above, and then, using this in formula (33), find the cor- 
responding value on the X; scale. This equivalent X, value 
would then be compared with the X, score of the other subject. 
TABLE 42.—Two Series ОЕ EQUIVALENT PERCENTILE SCALE VALUES: 
ATTITUDE TOWARD Wan 
Scale, X1 Scale, X; 


(Р„)* (Pn) 
5 13 v 
10 23 4 
15 32 E 
20 41 .6 
25 49 .65 
30 56 .8 
35 63 1.0 
40 69 1.2 
45 75 1.4 
50 80 1.6. 
55 85 1.9 
60 89 2.2 
65 93 2.6 
70 95 2.9 
75 97 3.2 
80 97.5 3.6 
85 98 3.9 
90 98.5 4.2 
95 99 4.6 
100 100 5.0 


* nth percentite scale value. 

The above shree methods аге not applicable to ungrouped or 
scanty data. 

When the data are inadequate, or when for other reasons we 
have more confidence in the ability of two scales to arrange items 
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in rank order than to measure distances between them, simple 
percentile ranks may be used for purposes of comparison. Given 
the scores on a test, the percentile rank is found for each score. 
For example, if 62 per cent of the scores made on a test are 
less than the score 80, the percentile rank of the latter is 62. 
For ungrouped data, the percentile ranks are found by the 
informal method outlined on page 132; for grouped data, the 
percentile ranks are obtained arithmetically from formula (34), 
above, or graphically from an ogive. The weakness of percentile 
ranks is, of course, that they do not reflect the distances between 
the scores on any scale. Thus, the score 70 may have a per- 
centile rank of 50, the score 77 a percentile rank of 60, and the 
score 85 a percentile rank of 90, so that the successive scores 
stand in the ratio of 1:1.1, whereas the corresponding successive 
percentile ranks bear the ratios 1:1.2 and 1:1.5, respectively. 
For this reason, the difference between percentile ranks should 
not be interpreted as proportional to the distance between the 
corresponding scale values. 

As a matter of fact, there is usually no feasible method of 
treating scores obtained from the use of very different kinds of 
scales that makes them strictly comparable. 


Exercises 


1. Compare the average deviation and the standard deviation of the 
series below. Find the standard deviation by formulas (24) and (28) 
as a check. 


NUMBER ог DEPENDENTS IN 25 FAMILIES ON RELIEF 


Dependents Family no. Dependents 
E 
1 3 14 5 
2 5 15 3 
3 4 16 3 
4 1 17 2 
5 6 18 4 
6 8 19 1 
7 2 20 3 
8 3 21 4 
9 3 22 3 
10 2 23 6 
11 4 24 2 
12 и 25 3 


m 
©з 
t2 
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2. Compare the average deviation and the standard deviation of the 
following frequency distribution, using for the standard deviation 
formula (26) with the Charlier check: 


Semester Hours or MATHEMATICS TAKEN BY 67 STUDENTS IN A CLASS OF 
ELEMENTARY SOCIAL STATISTICS 
Semester Hours Students 
43.5—46.4 1 
40.5-43.4 
37.5-40.4 
34.5-37.4 
31.5-34.4 
28.5-31.4 
25.5-28.4 
22.5-25.4 
4 
4 
4 
4 


e noe 
Slee SSnonannoooo 


3. Use the coefficient of variation, V, to measure the representative- 
ness of the mean of the distribution in Exercise 2, above. 

4. Below are two random samples of family incomes in a certain city, 
one taken in 1928, the other in 1932. Did the depression reduce or 
increase the spread in income between families? 


Number of families 
Income 

1928 1932 
Under $500 = 5 76 
500-999...... Pr 15 123 
1,000-1,499. Р 115 155 
4500-515 000. ^2. ыыы ел 190 91 
@0000=2;400. 5. 1:92 5902 paris snag 82 70 
2,500-2,999... ae 63 52 
3,000-3,499 27 17 
3,500-3,999 19 12 
4,000-4,499. . 10 7 
4,500-4,999 6 3 
5,000-5,499 3 1 

535 607 
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_ 5. Using the standard deviations found for the 1928 and 1932 series 
in Exercise 4, compute the standard deviation for the two series 
combined. 

6. The table below shows the number of children who required the 
Specified numbers of hours of social contact before they were “accepted” 
in a certain play group. (а) What percentage of the children took less 
than 4 hours? (b) What percentage of the children took more than 
10 hours? (c) How many hours did three-fourths of the children 
require lessthan? (d) How many hours did three-fourths of the children 
require more than? 


Hours Children 


о 
SluwanoSanar 


7. Given two independent scales, X and Y, for the measurement of 


“cooperation” between members of а random sample of urban families. 


Family A has a score of +1.2 on scale X, family B has a score of 860n * 
scale У. Reduce these scores to as nearly comparable terms as you can. 


—2.5-—2.9 4 0- 9 
—2.0-—2.4 12 10-19 68 
EN 22 20-29 109 
EOS 45 30-39 140 
—0.5-—0.9 71 40-49 131 
0.0-—0.4 89 50—59 91 
0.0--++0.4 116 60-69 74 
+0.5-+0.9 132 70-79 56 
+1.0-41.4 151 80-89 28 
+1.5-+1.9 93 90-99 18 
+2.0-+2.4 60 НЕ 731 
+2.5-42.9 17 
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8. Suppose that the frequencies for the Y scale 
reversed end for end of the scale, 


they are. Convert these scores to 


in Exercise 7, are 
while those for the X scale remain as 
а more comparable basis. 
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CHAPTER IX 


COMBINATION, PROBABILITY, AND THE NORMAL 
DISTRIBUTION 


1. Permutations and Combinations.'—It is often desirable in 
Sociological investigations to know the total number of ways in 
which а certain event can occur. For example, in a study of 
Intercity migration among five cities, how many paths can the 
Migration take? Or, among 10 girls in a boarding school, three 
two-girl friendships are found. How many such friendships are 
Possible in this group? The same kind of problem arises in 
connection with the binomial formula, discussed in Sec. 3 below. 

To answer the question about the paths of migration, we 
Notice that since a migrant may go from any of the five cities 
to any of the four remaining cities, the number of paths must be 
5 X 4 = 20. Not only do we count each pair of cities, but also 
the two orders or arrangements in which the members of a pair 
may be taken, as “from a to b," and “from b to a." А pair of 
Cities in a given order, e.g., “from a to b," is called a permutation, 
and the general formula provided by algebra for finding the 
number of permutations of n things taken т at a time is 


n! үз 
= e’ 36)? 
R= rl (36) 
For the problem above, we substitute in the formula, and get 
_ Sb. хех 3!) 
as 5-2! _ 3! 
—5x4- 20, 


аз before, 
Formula (36) is based on Theorem 1. 
^ 1 For a fuller treatment of this subject, see any text in college algebra, 
uk B. Fine, College Algebra, Chap. XXV, Ginn and Company, Boston, 
DURS 
^! is called “n factorial,” and means the product of all consecutive 


numbers from 1 through n. For example, 4! = 4 X 8X2 X 1 = 24. 
143 
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THEOREM 1. Ifan event A can occur in m ways, and thereafter 
an event В can occur in n ways, А and B can occur together in the 
order named in mn ways. 

A first approach to the problem of the boarding school friend- 
ships mentioned above can also be made by means of formula 


(36). The number of arrangements, or permutations, of 10 girls 
taken two at a time is 


! 
uP, = Y= 10x 9 = 90 


Here, however, there is no interest in the order of the girls in a 
two-girl friendship. When this is the case, če., when a group 
of things is taken without regard for the arrangement of the 
members, the group is called a combination. Evidently, each 
pair of girls can be arranged in two orders or permutations, so 
that the 90 permutations found above reduce to 3° = 45 combina- 


tions. The formula for combinations is, therefore, 
tom n! 
а mt (87) 


Using it, we get again 


ПЕЙ С = OTE 
—10х9_90_ 


Although formulas (36) and (37) apply to a large number of 
problems, some problems occur that are best approached inde- 
pendently. Asan easy example, suppose we ask, What is the total 
number of possible relationships that can exist between two 
persons, X and Y, in terms of attraction, indifference, and 
repulsion? To each of the three attitudes of X ; Y may respond 
with three attitudes, so that, by Theorem 1 above, we have 
З ХЗ = 9 relationships. These relationships are (1) mutual 
attraction between X and Y; (2) X is attracted by Y, but Y is 
indifferent to X; (3) X is attracted by Y, but Y is repulsed by 
X; (4) Mutual indifference between X and Y; (5) X is indifferent 
to Y, but Y is attracted by X; (6) X is indifferent to Y, but Y is 
repulsed by X; (7) mutual repulsion between X and Y; (8) X is 
repulsed by Y, but Y is indifferent to X ; and (9) X is repulsed by 
Y, but Y is attracted by X.- 
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а 2. Probability.—Chance, often called “luck,” and the tricks 
it plays are known to everyone. In a hand at cards, one may 
draw no ace; one, two, three, or even all four aces. Whether a 
person is male or female, white or black, European or American, 
is, as far as he is concerned, purely an accident. The occupation 
one follows, the person one marries, the state of one’s health, 
and so on, are also subject to a great amount of chance. Dis- 
covery and invention, even the trend in the development of a 
nation’s culture in the sociological sense, depend in part on 
thousands of small forces of which we have no knowledge. 
If the birth rates in a city differ in 1939 and 1940, is it because 
fundamental conditions affecting fertility have changed, or is 
the variation due merely to accidental factors that will cancel 
out over several years? In one random sample of old people 
there may be more male than female survivors and in another 
Sample exactly the reverse, regardless of the true proportion 
in the population. It is, therefore, not surprising that any 
careful attempt to investigate social life or culture is obliged 
to reckon with this element of chance. Chance distorts the 
findings of research, and must be allowed for. 

One of the greatest practical contributions of mathematics 
has been its discovery, beneath apparent confusion, of a remark- 
able regularity in the occurrence of chance events. By mathe- 
Matical means, we can estimate the amount of variation due to 
chance and predict the number of occurrences of any event whose 
Probability is known, e.g., the annual deaths in a class of insur- 
ance risks, On these mathematical laws of probability are 
founded great business enterprises lik 
basic techniques of a vast amount 0 
research, 

The exact mathematical definition of probability is this: If an 
event can succeed in m ways and fail in m^ ways, all equally 
че and mutually exclusive, and the event must either succeed 

T fail, the probability of its succeeding is 


e insurance, as well as the 
f scientific and industrial 


m 
P= mym Sa 
and that of its failing is 
/ 
q= — _. ip 
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That is, 
1 
PO ш Page p b (40) 


In other words, since an event must either succeed or fail, the 
probability of certainty is one in one, or unity. 


turn up as the other. This is an illustration of the theoretical or 
@ priori method. By the so-called empirical method, the chance 
of death within a year of a white male, aged 30, engaged in a 
clerical occupation, married, and an “A” medical risk, is found 
by simply counting the proportion of annual deaths occurring 
among a very large number of such individuals (say, 354 deaths 
among 85,707 persons, giving a probability of 0.00413). The 


above 


to approach nearer to 
some figure 0.00400, then 0.00400 may be regarded as an approxi- 


mation of the true (expected) proportion that exists in the given 
class as a whole (an infinite universe). But it would obviously 
be wrong to apply this death rate to a class in which the age was 
40 instead of 30 years! 

Two basic theorems of probability are 

THEOREM 2. Of two mutually exclusive’ events, A and B, 
if the event A has a probability of occurring, p, and the event B 
has a probability of occurring, р’, the probability that either A or 
B will occur in one possible way is p + p'. 


* Two events are mutually exclusive when in a single trial only one of 
them can happen. In а hand at cards, drawing an ace and drawing a jack 
are mutually exclusive events, but drawing an ace and drawing a diamond 
are not, because both may appear on the same card. If the two events are 
not mutually exclusive, the probability is р +- P' — pp’. 


ы O +. 
= ———  -—— ————— 1 


С 
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NC 8. If an event A has a probability of occurring, p, and 
om n hoa a probability of occurring with or after A in one pos- 
à way, р’, the probability that both A and B will so occur is pp’. 
aa application of Theorem 3 may be made to a typical 
о А community is inhabited by two groups of different 
со а zi and religious backgrounds, Swedish Lutherans and 
45 ye n Catholics. Among the Lutherans in the age class 40 to 
ms are 40 females and 44 males, among the Catholics 62 
com and 58 males, all married to someone included in the 
ue ation. The records show 18 mixed marriages, 11 between 
D ee and Catholic females, and 7 between Catholic 
E d na Lutheran females. How does this observation com- 
if thor: the number of mixed marriages that would be expected 
Lo p no prejudice for or against them in the community? 
Жорто, ds totals of Table 43. By the definition on page 145, 
mr ability of a marriage occurring in row (1) is n/N = xox) 
a marriage occurring in col. (1) is m/N = тоз. By 

ABLE 43.—TOURTOLD TABLE ron DETERMINING PROBABILITY OF Mrxep 
MARRIAGES 


Males 


T 
Females Lutheran | Catholic Total 
= (1) (2) (3) 
atholi 
Не Туа res CC of = 26.7 | уз = 35.3 | 62 = т 
Я 40 = т 


ifi = 17.3 
ай = рен 


curring in both row 


1) and column (1) is (n/N) (mi/N) = Gia) (245); therefore the 


and col ef › the expected frequency in the ce 
ie (2) is 40(58)/102 = 22.7. Thus the total number of 
mixed marriages is 26.7 + 22.7 = 49.4, or approx 


mately 49. 
Y 49; whereas, the observed number is 18, only 36 per cent 


of t 
Tay x eee number. Evidently, there are obstacles in the 
arriages between the Swedish Lutherans and the German 


С, Е 
atholics in this community. 
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This conclusion may be more fully established by applying the 
Chi-square (x?) method to Table 44. This method is designed 
to test the hypothesis that the differences between a set of 
observed and expected frequencies may be due solely to chance. 

To obtain x?, we subtract each expected frequency (f) from 
the corresponding observed frequency (fo), divide the squared 
difference by the expected frequency, and sum these ratios. 
The calculations are shown in Table 44. 


TABLE 44.—CHI-SQUARE (x?) ТЕзт 


Marriages 


Females Males | Ob- | Theo- | fo — f; |, — f)? (fo — Ў) 


served | retical fi 
(o) qo* 
Catholic..... Lutheran 11 26.7 | —15.7 | 246.5 9.23 
Catholie,.... Catholic 51 35.3 | +15.7 | 246.5 6.98 
Lutheran ....| Lutheran 33 17.3 


+15.7 | 246.5 | 14.25 
Lutheran ....| Catholic 7 22.7 | —15.7 | 246.5 |10.86 


0.0 41.32 = x? 


* If any theoretical cell frequency is less than five, a correction is needed. 
See Paul Rider, An Introducti 


to Modern Statistical Methods, рр. 112—113, 
John Wiley & Sons, Inc., New York, 1939. 


Tt was seen above that the expected frequencies used in Table 
44 were calculated from the row and column totals of the observed 
frequencies in Table 48. This means that the observed and 
expected frequencies in the cells of T 
extent made to agree. Evidently, this forced agreement should 
be allowed for in testing the amount of difference between the 
two sets of frequencies. In any 2х2 table, like Table 43, 
it is clear that if the row totals, the column totals, and one 
observed cell frequency are given, the other three cell frequencies 
are at once determined. Therefore, only one cell frequency is 
free of the influence of the marginal totals, so that a 2 2 table 
is said to have one degree of freedom. Tf now the value of x? 
obtained is referred to a table of x2, such as Appendix Table 2, 
that. takes. account. of. degrees. of. freedom, tha: squrigns, a 
1 ергеез of freedom for any contingeno 
teed he number of columns and r j Y table Are (0 - 1) t-I, 


x в the nim i 
Treloar, Elemente of Statistical Reasoning, рр, rw TAM. os AT 


able 44 were to a certain 
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blance between the observed and expected frequencies to which 
we objected above is corrected for. 

Entering Appendix Table 2 with one degree of freedom, then, 
we find that a x? as large as 6.635 could occur by chance once in 
100 times, the theory involved here being similar to that described 
in the latter part of Sec. 4, below. Our x is 41.32, which is 
much larger, and would occur by chance less often than once in 
100 times. Since it is customary to reject chance as the explana- 
tion of an event that can happen by chance no oftener than five 
times in 100, we conclude that the frequency of mixed marriages 
in the community cited is reduced by sociological and perhaps - 
economic forces. 

The classic method of introducing the elementary notions of 
Probability is to use the illustration of coin tossing. The event 
i the occurrence of a “head” or a "tail" We may toss one 
Coin several times, several coins once, ог several coins several 
times, as we wish. It is evident that the events are mutually 
exclusive, аз specified in Theorem 2, above. We may also 
assume that all the coins tossed during the experiment are 
exactly alike in size, weight, shape, and balance, i.e., in respect 
to every fixed or biased factor that affects the tendency of heads 
ог tails to fall uppermost when the coin is tossed. In this 
Wày we meet the requirement that each event of а probability 
ay shall be equally likely. Differently expressed, it is assumed 

hat the probability, р, of throwing & head is the same for every 
ae. at each throw, and that every penny at each throw is 
ч pendent of every other penny, 7.¢., there 1$ no tendency for 
BN penny to show heads or tails because another does or does 
of d would happen if they were stuck together. Finally, 
and € two events that can occur, one, heads, we call a success, 
ai the other, tails, we call а failure. Having specified these 

hditions, our first question is, What is the probability of 
Rene а head, or of getting a success, ab any one toss of a 

ny? In other words, what is the value of p? 


Since in a single toss of one penny there is only one way in 


MY i n + =. 
ша Success can occur and one way in which a failure can 
то . 
divid The., New York, 1939. A contingency table is a table of frequencies 
such as the 


Boe Redding to two or more principles of classification, 
Xercise 7 at the end of this chapter. 
f formula (88) 


А < 


1 n 56 | 
or (39). Probability set’? is described by the denominator o 
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occur, and we assume that the pennies are balanced so that these 
two events are equally likely, we have, in the notation introduced 
above, т = т’ = 1. Hence, from formulas (38) and (39), 
р = q, and substituting p for 4 or а for p in formula (40), we 
find that p = q = + = 0.5. 

Suppose that we throw 10 pennies, and want to know the 
probability of getting exactly eight of a kind, i.e., eight heads or 
eight tails. If eight of 10 pennies show heads, then the other 
two must show tails, or vice versa. We just saw that if we throw 
one penny, the probability of getting a head in one throw is 
р = .5. By Theorem 3, above, the probability of eight successes 
occurring in one possible way is р® = (.5)8, the probability of two 
failures occurring in one possible way is q? = (.5)%, and the 
probability of these two events occurring together in one possible 
way is pq? = (.5)*(.5)2. But the eight heads may occur among 
the 10 pennies in several possible ways, so that by Theorem 2 the 
probability of occurrence in just one way should be summed as 
many times as there are possible ways, or, more briefly, multiplied 
by the number of possible ways. How many possible ways are 
there? This is equivalent to asking, In how many ways may we 
get eight heads from 10 pennies, or, how many possible combina- 
tions are there of 10 (= n) things taken eight (= т) at a time? 
To answer this, we already have formula (37) above, which for 


our problem gives! 
ис, = 101 _ 10(0)(8)(7)(6)(5)(4)(8)(2)(1).___ 4 
2181 (8)(7)(6)(5)(4) (3) (2) (1) (2)(1) 


Hence the probability, Р, of getting exactly eight heads in a single 
throw of 10 pennies is 


Р = „Cpg, (41) 
Р = 45(.5)8(.5)? = 45(.5)10. 
Using logarithms,? we find 

log (.5)1° = 10 log .5 = 10(9.69897 — 10) = 96.98970 — 100. 


1Бее Appendix Table 3. For extensive table of factorials or their loga- 
rithms, see T. С. Fry, Probability and Its Engineering Uses, pp. 427—438, D. 
Van Nostrand Company, Inc., New York, 1928. A briefer table is given in 
Mathematical Tables from Handbook of Chemistry and Physics, 5th ed., p. 
180, Chemical Rubber Publishing Company, Cleveland. 

3 Seo Appendix Table 7 and accompanying Foreword. 


or 
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The antilogarithm of this is .0009766. Hence 
P = 45(.0009766) = .044. 

That is, in 44 out of 1,000 trials we would expect by the laws 
of chance to get exactly eight heads in a toss of 10 pennies. 
Similarly, by Theorem 2, the probability of getting either eight 
heads or eight tails is 2 X 0.044= 0.088. This last is the proba- 
bility that answers our question. Any similar question can be 
readily answered by substituting in formula (41), above. 

3. The Binomial Distribution.—We often want to know the 
probability of getting as many as or more than a specified number 
of successes or failures. From what has been said, it will be 
seen that the probability of getting no successes at all in a toss 
of п pennes is q^, of getting one success is ,Cipq", of getting 
two successes is ,C2p?g”*, and so on, and, finally, the probability 
of getting all successes is p”. Since these combinations of events 
exhaust the possibilities, some one of them is certain to occur 
at any toss of n pennies. In other words, the probability of one 
Or another of them occurring is unity, or one. By Theorem 2, 
we may therefore write the equation 
a+ aC ipg?-t 18 „Сара + aap +... + Сло" 

фота 2 
But by formula (40), р + g = 1, and hence (р+д4д"=1. It 
therefore appears that by substitution 
(0 + р)" = q^ + „Съра + Cope + „Сура“ 
+. 7+2. 3) 
Ир = q = 4, the formula simplifies to 
G+ ay = @"а E.G secco ob D. 09 
This is the familiar binomial expansion of algebra, which is now 
Seen to be an expression of the operation of the laws of chance!" 


‘In algebra, the binomial formula is usually written: 


(9 +p) = 9" + i qi» д =» qp n(n — Dn — ре 2) qp? 
+- +r 


ru it is pointed out that the exponent of q decreases by 1, while the expo- 
meat of p increases by 1, each term; and that the coefficient of any term, if 
ultiplied by the exponent of 4 and divided by the number of the term, gives 


t ki 
© coefficient of the next term. 
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To discover the probability of getting, say, eight or more heads 
in a single toss of 10 pennies, therefore, we need only apply the 
binomial. The probability of getting eight or more heads means, 
specifically, the probability of getting eight, nine, or 10 heads ; 
and by Theorem 2, this is equal to the sum of the probabilities of 
the three separate events. By formula (41), which is the general 
term of the binomial, the probability of eight heads is 10 sp*q?, 
of nine heads is ;Csp*g, and of 10 heads is p". Summing these, 


Р = Cap? + 10Сәр? + p!? = 45(.5)8(.5)? + 10(.5)9(.5) 
+ C5) = (.5)1°(45 + 10 + 1) = .0440 + .0098 + .0010 
= 0.055. 


Accordingly, in 1,000 throws of 10 pennies, we may expect to get 
eight, nine, or 10 heads about 55 times. And the probability of 
getting eight or more heads or eight or more tails is, of course, 
2 X .055 = .11, or 11 times in 100 throws. Notice that this is 
merely the most probable number and will vary from one set of 
100 throws of 10 pennies each to another. But in a very large 
number of throws the average proportion should come rather 
close to 11 per 100 throws of 10 pennies each. 

Suppose, again, we throw 10 pennies 150 times. In how 
many of these trials may we expect to get exactly eight of a kind? 
Since we have found this probability to be 0.088, we may expect 
this event in the proportion of about nine times in 100 trials, 
in the long run. If V represents the number of trials, and S 
the number of trials in which the specified event may be expected 
to happen, the formula is approximately 


S = PN. (45) 


Substituting P = .088 and N = 150 in this formula, we find 
5 = .088(150) = 13.2. That is, in 150 tosses of 10 pennies each, 
about 13 is the most probable number of tosses that will show 
exactly eight heads or eight tails. 

Similarly, if it is wanted to know the frequency with which each 
possible number of successes, from 0 to п, may be expected to 
occur by chance in М trials of л events each, each term of the 
binomial expansion in formula (42) or (48) is simply multiplied 
by М: 
ФМ + „бур + „бру + ++ + Ср 

ом ЧҮ E) 


e—a 


| 


| 
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Thus, if we throw 10 pennies 1,000 times, we have 


1,000(.5 + .5)1° = .00098(1,000) + .00977(1,000) + 
.04395(1,000) + .11719(1,000) + .20508(1,000) + .24609(1,000) 
-+ .20508(1,000) + .11719(1,000) + .04395(1,000) 

: + .00977(1,000) + .00098(1,000), 


or 


.98 + 9.77 + 43.95 + 117.19 + 205.08 + 246.09 + 205.08 
+ 117.19 + 43.95 + 9.77 + .98 = 1,000. 


This is really a binomial frequency distribution! and is so ar- 
ranged in Table 45. From this table, we see that out of 1,000 
tosses of 10 pennies each, we would expect no heads in only 
about one toss, one head in something like 10 tosses, two heads 
in approximately 44 tosses, and so on. 


TABLE 45.—Frequency DISTRIBUTION OF 1,000 Tosszs or 10 PENNIES 
Number of Number of 
Heads (X) Tosses (f) 
.98 
9.77 
43.95 
117.19 
205.08 
246.09 
205.08 
117.19 
43.95 
9.77 
10 .98 


Фф о -do0o0m»0t--o 


Let us now pass from the theoretical case of penny tossing to 
some problem that might arise in social research. For example, 
the proportion of males in the urban population of Wisconsin 
in 1930 was pı = 0.4974; in the rural nonfarm population, 
P2 = 0.5118; and in the rural farm population, ps = 0.5435. 
If we regard the three populations—urban, rural nonfarm, and 
rural farm—as ranked in the order of urbanness, and if we sub- 
tract the proportion of males in the less urban from that in the 
more urban of each of the three possible pairings of these popula- 
tions, we get 

1 See Chap. V. 
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рі — po = 4974 — .5118 = —.0144. 
Pi — ps = 4974 — .5435 = —.0461. 
P2 — рз = .5118 — .5435 = —.0317. 


We notice that all three of the signs are negative. . In the case of 
another Middle Western state taken at random, ће same result 
was found. Does this mean that the proportion of males is really 
greater in the more rural populations, or may the negative signs 
in the two states be just a trick of chance? By formula (36), we 


! 
see that there are ;P; = x = 6 possible orders of relative magni- 


tude that pı, рг, and рз can take (e.g., р: < рз < ps; P2 < pi < рз; 
etc.), if we assume that they are never equal (i.e., p1 = pa 7 ps). 
Since the order observed, pı < рг < рз, is only one of the six, the 
probability that it will occur in one random trial (or state) is }. 
Hence the probability of getting only negative signs in both states 
is 
2C2(%)?(6)° = ($)? = Fe = .028 

by formula (41). 

Statisticians usually insist on odds of at least 5 in 100, or 0.05, 
before they will risk the assumption that a result is not due to 
chance. By this standard, we eliminate chance in the present 
case, and are entitled to conclude that the proportion of males in 
the three populations is related to the degree of urbanness in those 
populations. 

In many situations similar to this, the binomial theorem 
enables us to determine the probability that repeated events may 
occur by chance alone, and to note whether or not the probability 
is so small that we may reject the hypothesis that chance is 
responsible. 

It is important to ask what is meant by chance in the preceding 
ilustration. If we regard the census figures for the three popula- 
tions as representing three complete universes, there is no 
question of chance at all. Any differences noted in the propor- 
tions of males, however small they may be, are real differences 
between the universes, and that is the end of the matter. But 
if we think of the proportion of males in each of our three popula- 
tions as determined by a separate set of causes acting to pro- 
duce sample results, and if we want to know whether or not these 
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three sets of forces differ in any real way from one another, the 
problem of chance at once enters. By chance we mean a great 
number of small, unknown factors acting in many directions, as 
contrasted with large (biased)! factors, usually known or know- 
able, acting constantly in the same direction. If the biased factors 
affecting the proportion of males differ from one of the three 
populations to another—e.g., more females than males migrate 
from rural to urban areas—the observed proportions of males will 
differ to a greater extent than can be accounted for by the action 
of small random forces. If the biased factors that produce the 
proportion of males in each of the three populations are essentially 
the same, however, any variation in the proportion of males 
from one population to another must be due to chance factors 
alone. It is usually good research method to seek to eliminate 
chance as a possible cause of differences before undertaking to 
discover what factors are responsible. 

If we already know from independent evidence, however, that 
important factors influencing the proportion of males varied 
between the three populations—e.g., the two sexes migrated 
unequally from the more-rural to the less-rural areas—there 
would be no point in testing the hypothesis that the differences 
were due to chance, except perhaps to confirm the a priori 
knowledge. When such a test fails to eliminate chance, it often 
means only that a larger sample is needed. It may sometimes be 
advisable to investigate carefully the biased factors in the situa- 
tions under comparison, even when chance has not been elimi- 
nated as a possible cause of the differences observed between them. 

A binomial distribution, such as that of Table 45, is like other 
distributions in having a mean, а standard deviation, and other 
statistical constants by which it may be described. The formulas 
for the mean and the standard ‘deviation are 

М» = np, 
св = V nPI, 
where the symbols have the same meanings as above. 

For the distribution of Table 45, М» = 10(.5) = 5 heads, and 
св = \/10(.5)(.5) = 1.58 heads. 

1 See а i . 149, above. 

з For ver таро ар see, for example, C. H. Richardson, 
An Introduction to Statistical Analysis, pp. 228-230, Harcourt, Brace and 
Company, Inc., New York, 1934. The subscript, B, means binomial. 


(47) 
(48)? 
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It is not necessary in chance situations that p and q should be 
equal. Thus, the probability of throwing an ace in a single toss 
of a die is p = $, and the probability of not throwing an ace is 
а =. If 15 throws are to be made, the binomial is (4 + $)!5, 
and this can be expanded and utilized just as was done above 
for р = 9 = +. When р = q, the binomial is symmetrical in 
shape, when р 2 q!, it is asymmetrical or skewed. 

4. The Normal Distribution.—Graphs of the binomials 
32($ + з)? and 1,024($ + 4)" are shown in Fig. 38. Notice that 


Y*7hrows Yz7hrows 
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Fic. 38.—Histograms of two binomials, N(4+4)", as n increases. 
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+0; 
+580 +1.90 
Fic. 39.—The normal curve. 


they take the form of histograms rather than of smooth curves, 
because successes are counted only in whole numbers, yielding a 
discrete or discontinuous series. However, if the length of the 
scale is kept constant, as in the figure, the graph of the binomial 
1,024($ + $)!^ is seen to be less broken in outline than is that of the 
binomial 32(; + #)5. As increases, the graph approaches closer 
and closer to a smooth curve in appearance. If now n is indefi- 
1 2 means greater or less than. 


e 
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nitely increased, giving the binomial NG + +)”, itis evident that 
the intervals of the graph become smaller and smaller, until 
in effect the outline merges into that of a smooth curve. The 
resulting curve is the most important type of distribution in 
statistical theory, and is known variously as the normal curve, the 
Gaussian curve, the curve of error, or the curve of probabilities. 
Unlike the binomial distribution, it represents ‘a continuous 
variable, which can take any value whatever, on the X scale. 
A graph of the normal curve is shown in Fig. 39. It may be 
thought of as enclosing a continuous surface, cut from a piece of 
thin sheet metal. Its equation is usually written 


М за 
= PEL 49 
y y o (49) 
where z — X — M, or a mean deviate of X, 

N = total frequency of the distribution, 


т = 3.1416, so that \/2т = 2.5066, 
е = 2.7183, the base of natural logarithms. 
If the area of the curve is taken as unity, equation (49) becomes 


1 pe 
=———@?= (50) 
y с. V 2m 


As an aid to understanding the curve represented by equation 
(50), let us analyze its equation. We shall begin by letting 


x 
=. = 580 that equation (50) becomes 


1 73» (51) į 


In the calculation of tables of normal ordinates, it is also con- 


venient to let с: = 1, giving 


ET 


cras (52) 
= 6 
y Ns 
But, as seen above, т is а mathematical constant with the value 
ЕЕЕ 13989, Equation 


3.1416, so that у2т = 2.5066, and 25066 
(52) may therefore be written 


-it 


у = .3989¢ > (53) 
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In Fig. 39, at the mean of the X's on the X axis, z = 0. The 
height of the ordinate at any point is the value of y found from 
equation (53) by substituting the appropriate value of А+ 
z= 0, t = 2/0. = 0/с. = 0, and 

— (0)? 


y = .3989e 2 
у = .3989¢°, 


But any number raised to the zero power is 1*, so that 
у = .3989(1) = .3989. 


In other words, at the mean of the Х ’s, the height of the ordinate, 
y, is .3989, for any normal curve of unit area and unit standard 
deviation. This is plotted in Fig. 39. 

Next, for the same case, let # = z/c. = +2. Then, by 
formula (53), 


— (2): 


у = .3989е ? , 
y = .3989e7t, 
y = .3989e-?. 
But 
LAM 
ет" = = 
e 
so that 
_ .3989 
ее“ 


It has also been seen that e, like т, is a mathematical constant, 
having the value 2.7183, so that e? = 7.38906. Hence, at 


-3989 
у= 7.38906 = .05399. 


This value is also plotted in Fig. 39. Notice that at = = ~2, the 


value of y is the same as at 2 = +2, for in formula (53) evi- 


dently e~* is the same as e-—2*, The normal curve is thus 
symmetrical, 2.е., of the same shape, on each side of the mean. 
From this it follows that the mean and the median coincide. 

* See any text in elementary algebra. 
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The student is asked to check the values of y found above at 
2/с. = 0 and at z/c. = +2 against those printed in Appendix 
Table 1. All the values in that table are calculated in this way, 
and may be used to complete the construction of Fig. 39. Thus, 
the height of the ordinates at +1c, read from the table, is .2420, 
and is so scaled in the figure. After several ordinates have been 
drawn, they are connected by a smooth line, to form the curve 
shown. 

The tallest ordinate of the normal curve occurs at the mean, 
hence the mean, median, and mode all coincide. This appears 
from the fact that when z= 0, у = .3989; whereas, when 


a 
z 20, у = .3989/е?. The latter term is always smaller than 
the former, since all positive powers of е are greater than 
1(е = 1). 

Another characteristic of the normal curve is that it is asymp- 
totic to the X axis, meaning that the curve constantly approaches 
but never touches the X axis as it extends indefinitely in both 
directions from the mean. 

Table 46 shows a hypothetical normal distribution with 
perfectly symmetrical frequencies. The actual frequencies of 
normal tables may depart in various degrees from this sym- 
metrical pattern, because of sampling errors or the use of class 
intervals that do not place the mean of the series exactly at the 


center of the distribution. 


TABLE 46,—Normat DISTRIBUTION OF SCORES ON AN ARMY ATTITUDES TEST 
(HYPOTHETICAL Dara) 


Scores Men 
(Х) Q) 
0- 4.9 5 
5- 9.9 17 
10-14.9 44 
15-19.9 92 
20-24.9 150 
25-29.9 191 
30-34.9 191 
35-39.9 150 
40-44.9 92 
45-49.9 44 
50-54.9 17 
55-59.9 _5 
998 
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With the help of the integral calculus, it is possible to find the 
proportion of the area under any part of the normal curve, t.e., 
between the ordinates erected at any two points on the z scale. 
This has been done for the areas between the ordinate at the 
mean and ordinates erected at intervals of .01с along the z-axis. 
The results are shown in Appendix Table 1, in the column 
headed “Area.” Thus the area under the curve between the 
ordinate at the mean and the ordinate at 10 is seen to be 0.34, or 
34 per cent (roughly one-third) of the total area under the curve. 
In Chap. VI we saw that the area under a frequency histogram, 
where the width of the interval is taken аз one unit, is equal to 
the total frequency of the distribution. The same principle holds 
for the normal curve. 

Since the normal curve represents the distribution of frequen- 
cies in any normal universe, the proportion of the area between 
the ordinate at the mean and the ordinates at, say, т = +1с 
represents the most probable proportion of the frequencies of 
any random sample drawn from such а universe that may be 


expected to fall between the values : — 0 and 2 = +1. Dif- 


ferently expressed, the Proportion of the area between the 
ordinate at the mean and the ordinates atz = +1¢ is the proba- 
bility that a random sample value of X will fall between М, and 
tle. We see from Appendix Table 1 that this probability is 
twice 0.34, which is approximately 0.68, or 68 per cent. It 
should now be clear why in a normal distribution the odds are 
about two to one that a random value of X will be within a 
range of one standard deviation on each side of the mean value of 
X. Also, inasmuch as a value of X falls outside the range of 
М» + 2c by chance only 1.00 — (2 X 0.477) = 0.046, or about one 
time in 20, we shall be fairly safe if we attribute those values that 
do so to something else than chance. In other words, we shall 
arbitrarily regard all such extreme values as significant. 

Reading again from Appendix Table 1, it is seen that approxi- 
mately 25 per cent of the area of the normal curve lies between 
the mean and an ordinate at z = 0.675. That is, one-half of the 
area of the curve is included between an ordinate at —0.67о and 
an ordinate at --0.67о. From a finer table, the figure is found 
more exactly to be 0.67450. The distance 0.67457 from the 
mean along the z-axis of the normal curve is commonly called the 


D n 


А 


‘get 0.0287 as the area to the right of the ordinate at 2/с = 
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probable error (P.E.), and is often used instead of the standard 
deviation, с, or standard error, as it is called in sampling theory 
(see Chap. XII). 

The relationships of the preceding paragraphs do not hold, 
however, for skewed distributions. This may be seen from Fig. 
41. By comparing the rectangles in the areas M — lo and 
М + 10, it is clear that in this case a much larger proportion 
of the area of the curve is contained between M — le than 
between M + 1c, so that the standard deviation has no constant 
relation to the area or frequency. For this reason, the standard 
deviation has a variable meaning when applied to asymmetrical 
distributions, and should be cautiously interpreted in such cases. 


hA 5/) 
B 
| 
24 PM sc as 
-lc М +“ -io M +“ 

Fie. 40—Relation between stand- Fic. 4l.—Relation between 
ard deviation and area under normal standard deviation and area under 
curve, skewed curve. 

In a normal distribution, A.D. = .80c, so that the distance 
М + A.D. on the scale includes about 58 per cent of the fre- 
quencies (see Appendix Table 1). à 

When т is large, the labor of expanding the binomial becomes 
excessive, Under these conditions, if the value of np or ng is not 
too small, say 5 or more, the binomial so closely approximates the 
normal curve that the latter may be used in its stead for purposes 
of estimation, and the desired probabilities simply read from 
Appendix Table 1. 

Consider again the probab: 
or tails in a toss of only 10р 
pendicular at the point 


zs X—-X X-mn 8 — 10(-5) _ 19 


A 


ility of getting eight or more heads 
ennies. In Fig. 39 we erect а per- 


с ря ^/ пра я ^/10(.5)(.5) 
and find the area between this ordinate and the ordinate at the 
mean. Entering Appendix Table 1 with z/c = 1.90, we see 
that the area desired is 0.4713 of the area of the whole curve. 


Subtracting 0.4713 from 0.5000, the area of half the as 
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Since the normal curve is assumed to represent the results of all 
possible tosses of 10 pennies, the area to the right of z/c = +1.9 
shows the proportion of tosses that in the long run may be 
expected to give eight or more heads. This proportion is the 
probability of getting eight or more heads in one toss of 10 pennies, 
so the probability of getting eight or more heads or eight or more 
tails is twice this, or P = 2(0.0287) = 0.0574. The true value 
of P as found above from the binomial expansion is P = 0.1094. 
The agreement is thus seen to be none too good when n is аз small 
as 10. If n is increased to 15, however, np = 7.5, and we find 
more agreement. The probability of getting say 12 or more 
heads or tails is 0.0204 according to the normal curve, and 0.0176 
according to the binomial,! the error being only 0.0028. For 
larger values of п, the two estimates may for most purposes be 
accepted as equivalent. 

The approximate probability of getting exactly eight heads or 
eight tails in a toss of n = 10 pennies is the height of the ordinate 
of the normal curve at the point X = 8, expressed in standard 
deviation units. This is because the number 8 is represented 
on the X scale by a point rather than by a distance, and on this 
point can be erected only a straight line, or ordinate, which 
theoretically has no width and hence no area. We now need 
вооа cB MO Lio arom Appendix 
0: <  4/10(0.8) (0.5) 


= ex 0.0656 a? 
Table 1 we find y 9,0006; so that 00505 0.0415, 
and 2 X 0.0415 = 0.083 is the probability desired. The correct 
probability already found by the binomial is 0.0879. 

If we choose to consider the normal curve merely as a device for 
approximating the probabilities of the binomial, rather than as a 
continuous mathematical distribution, it becomes possible to 
take certain liberties with it that will improve its accuracy for the 
purpose. For example, to determine the probability of throwing 
eight or more heads or tails in a toss of 10 pennies, we may allow 
the value X = 8 to occupy the area under the normal curve 
between the X values 7.5 and 8.5, and regard the area to the 
right of 7.5 as representing the probability of throwing eight or 
more heads. We may then erect a perpendicular in Fig. 39 


1 uCisp"g* + Cupia? + uCup!'g + p!5 = (3)55(455 + 105 + 15 + 1) 
= 0.01758. 
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at the point 
X—np _ 7.5 — 10(0.5) _ 


Мар  /10(0.5)(0.5) 95 


and find the area between this ordinate and the ordinate at the 
mean to be 0.4429 (Appendix Table 1). The area to the right 
of the ordinate at 1.580 is, therefore, 0.5000 — 0.4429 = 0.0571, 
which is the probability of throwing eight or more heads. The 
probability of throwing eight or more heads or eight or more 
tails is 2 X 0.0571 = 0.1142. This result is much closer to the 
correct binomial probability of 0.1094 than was that obtained 
above in the orthodox way. Indeed, the accuracy of the normal 
curve in approximating the binomial has now been made quite 
satisfactory even for n = 10. 

It is also possible to use a similar manipulation in estimating 
the probability of throwing exactly eight heads or eight tails in a 
toss of 10 pennies. We find from Appendix Table 1 the area 
under the curve included between an ordinate at X = 7.5 and 
an ordinate at Х = 8.5. The table gives 0.4864 as the area 
between the mean ordinate and the ordinate at X = 8.5 (i.e. at 


t= 5.5 = 100.5) . = 2.210), and 0.4429 as the area between 
4/10(0.5) (0.5) 
the mean ordinate and the ordinate at X = 7.5 (i.e., at 
2 = 7.5 — 100.5) с = 1.580). 
/10(0.5) (0.5) 
Consequently, the area between the ordinate at X = 7.5 and 
the ordinate at X = 8.5 is 


0.4864 — 0.4429 = 0.0435. 


This is the probability of throwing exactly eight heads in a toss 
of 10 pennies ; so the probability of throwing exactly eight heads 
or exactly eight tails is 2 X 0.0435 = 0.0870. The error from 
the binomial (0.0879) in this case is negligible. 

It should be noted that special modifications like those above 
in the use of the normal curve are usually worth while only when 


np is small, say np < 6. 


т 
о 


е to deal usually depart con- 


with which social scientists hav ! 
ution is shown 


Siderably from the normal form. Such a distrib 
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in Table 47 and in Fig. 42. It is readily seen to extend farther 
in the positive direction from the mean than in the negative 
\ 
va 


х 
о Меап 


Fic. 42.—Skewed distribution. 


direction, and so is said to be positively skewed. If there is 
occasion to measure the amount of the skewness, an index is 


TABLE 47.—Ввглттув NUMBERS OF DIVORCED COUPLES BY YEARS MARRIED 


Years, Divorced | Accumulated 
married (X) | couples (f) frequency 
0-0.9 15 15 
1.0-1.9 72 87 
2.0-2.9 60 147 
3.0-3.9 43 190 
4.0-4.9 21 211 
5.0-5.9 17 228 
6.0-6.9 9 237 
7.0-7.9 8 245 
8.0-8.9 5 250 
9.0-9.9 2 252 

Тов... 252 


provided by formula (54): 
E 3(М = ма). 


(54)! 


1 We saw in Chap. VII that the value of the mean, M, is influenced by 
extreme values, and hence by skewness, but that the value of the mode, Mo, 
is not affected. In the present chapter it was learned that in a normal dis- 
tribution the mean, mode, and median all have the same value. These 
facts suggest as an approximate measure of absolute skewness, Sk, the 
difference 


Sk = M — Mo. (55) 
To change this to generalized units, we may write 
Sk = M — Mo (56) 


с 


Because the value of the mode can seldom be accurately determined, how- 


| 
| 
| 
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The values of Sk by this formula vary between +3, but values 
is than +1 do not often occur. If there is no skewness, 
k =0. 
A more useful measure of skewness for some purposes is 91, 
which for large samples is approximately 


а= (58)1 


vs is the third moment about the mean of the distribution, defined 
by the equation уз = Zfz?/N, where = is а mean deviate as 
usual. 

For a normal distribution gı = 0. For other values of gı the 
sign indicates the direction of the skewness. Values of gi as 
great as +2 mean decided skewness. 

A frequency distribution may also depart from the normal in 
height or''peakedness." This is called kurtosis. If the observed 
distribution is flatter than the normal, it is said to be platykurtic; 
if more peaked, leptokurtic; if neither, mesokurtic. Kurtosis may 
be measured by g». For large samples, an approximate formula is 

v 

g= = —8 (59) 
v4 is the fourth moment, Zfz*/N, of the distribution; and с“ is 
the second moment, v; = с? = £fz*/N, squared. 
gz also is zero for a normal distribution. A positive value of 
9» indicates that the observed distribution is more peaked than 

the normal, and a negative value indicates that it is flatter. 
АМЕ М Se 
it by its equivalent in terms of 
4 distribution, the median falls 
de to the mean (see Chap. УП, 


ever, it is considered preferable to replace 
the median, Md, In any moderatey skewe! 
about two-thirds of the distance from the mo 
Fig. 31). We therefore have 


Mo = M — 3(M — Ма). (57) 
Substituting this value of the mode in formula (56), 
— [м —3(M — Ma), 
gy Ми = 3 = M3), 
въ = OL, (54) 


c 


1» is the lower-case Greek letter nu. 
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Before formula (58) or (59) can conveniently be applied to a 
distribution like that of Table 47, some short-cut calculating 
formulas are needed: 


уз = а? = ife = ex] (60) 
ag [ле 2 ш =: ой (61) 
м N № 
= dre — $ yeza + ү уйуй, — В oy]. (62) 
y4 = N М № №: ' 


where 4 = width of class interval. 

d — unit step deviation from an assumed mean. 

N = 2f. 
Notice that formulas (61) and (62) are merely extensions of the 
familiar short method of finding a standard deviation by the use 
of an assumed mean and unit step intervals. This appears 
clearly in Table 48, below. 

Let us now measure the skewness and kurtosis of the distribu- 

tion shown in Table 47, by comparing it with the normal curve. 
We set up the computing table: 


TABLE 48.—Computine TABLE ror Moments: DATA or TABLE 47 


Years married 
0-0.9 
1.0-1.9 
2.0-2.9 
3.0-3.9 
4.0-4.9 
5.0-5.9 
6.0-6.9 
7.0-6.9 
8.0-8.9 
9.0-9.9 

оба epe 


Recalling the short formula for the mean, 
гуд 
N , 
where A is the assumed mean, we find for this table, 
М = 25 4-185) = 3.11. 


M=A+ 


жу Жж 
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Substituting in the formula for the median, 
Md =L+ (287) $, 


where the symbols have the meanings explained in Chap. VIL. 


We find 
126 — 87 
60 


For the standard deviation, we have 


_ [ae (л 
c=2 N N , 
в = 1 VI = (HDS 
o = 1.98. 
Hence, according to formula (54), we find the skewness to be 


_ 38.11 — 2.65) _ 0.72 
А ООЗЫН ea 
This shows considerable skewness in the positive direction. 

Let us next measure the amount of skewness in Table 48 by 
the use of formula (58). From formulas (60) and (61) we find 


о? = (1.93)? = 3.73, 


Ma = 2.0 + (1) = 2.65. 


Sk 


› 2 
в = 55 [3820 — -3_ (154)(1034) + (252 аз | = 8.09. 


252 
Substituting in formula (58), 
8:09, _ 
в = (gay = 118 


This result agrees with tht; obtained by formula (54), in 
showing positive skewness. eee S 

We shall now measure the degree of kurtosis, if any, exhibited 
by the distribution of Table 48, through the use of formula (59). 
We need only one new value, v4, which may be found by formula 


(62), 
Bu 4 __6__ (1,084) (154)? 
^ = 255 | 20,054 а (3,820) (154) + asy ( < 
— (650 ass] 
XS CEU NES 
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Substituting in formula (59), 


53.70 
n= {уузу — 300 


ga = 3.86 — 3.00 = 0.86 


The value of gz is positive, so we conclude that the observed dis- 
tribution is leptokurtic, or more peaked than a normal curve.! 

Even though a sample distribution is found by the above 
methods to differ from the normal, the question arises whether or 
not the difference is one that might be due merely to random 
errors of sampling. This point is dealt with in Chap. XIII. 


Exercises 


1. Twelve children are to be used in the experimental study of domi- 
nating and submissive types of behavior. Each child is to be grouped 
(a) with one other child, (b) with two other children. What is the total 
possible number of such experimental groups of each size? 

2. Four villages, five cities, and five rural counties are to be grouped 
in all possible combinations of буе. No distinction is made between 
areas of the same type, i.e., one village is the equivalent of another 
village. What is the total number of combinations? Describe them. 

3. The types of contact between families in a community are listed 
as: visit, church, lodge, school, business, and "other." But any or all 
of these contacts may appear together, as well as separately. How 
many combinations of all kinds are there between these several types 
of contact? * 

4. The educational levels of a sample of husbands and wives are 
recorded as college, high school, grades, and illiterate. What is the 
total number of possible permutations of husband-wife relationships 
in terms of these levels, and what are they? 

5. How many marriages are possible between three pairs of brothers 
and sisters in our society? 


1 Another measure of kurtosis that is more commonly used than g: is Ва: 
в == (63) 


For Table 48, above, 8, = 3.86. Since in a normal distribution Вз = 3, 
the observed distribution is again seen to be leptokurtic. 
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6. In an experiment with four pairs of subjects, each pair consists of 
a male and a female, closely “matched” in respect to certain sociological 
characteristics. They are to be given a test while seated around a table 
in such a way that the sexes alternate, and no members of a matched 
pair sit next to each other. In how many ways may this be done? 

Nore: The number of different permutations of т things taken n at 
time when arranged in a circle is given by the formula (n — 1)! 

7. Gist and Clark give the following table: 


DISTRIBUTION or INTELLIGENCE Scores or 2,544 (Kansas) Вовль HicH- 
SCHOOL STUDENTS IN 1923, ACCORDING TO PRESENT RURAL AND 
URBAN CLASSIFICATION * 


I.Q. Urban Rural Total 
378 832 1,210 
326 472 798 
260 276 536 


964 1,580 2,544 


* American Journal of Sociology, July, 1938, p. 43. 


Compare the observed frequencies with those expected by chance alone, 


apply the x? test, and comment on the results. 

8. Classification of many cases shows that the probability of а mar- 
riage ending in divorce under certain conditions is 0.20. In a sample of 
20 such marriages, what is the probability that there will be no divorce? 
What is the probability that there will be no more than two divorces? 
Compare the results from the binomial with those from the normal 
curve, 

9. In Exercise 8, if many random samples of 20 marriages each were 
taken from the type of marriage referred to, (а) What mean number 
of marriages per sample would be expected to end in divorce? (b) What 
Would be the standard deviation of the numbers of marriages ending in 


divorce found from many samples? TEAS. 
10. Calculate skewness and kurtosis for the distributions below: 


FAILURES on PAROLB IN 50 SUBSAMPLES OF Frvp Prisoners Each 


Failures Frequency 
1 
10 
17 
15 
7 
‚ 0 
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FAMILIES By SIZE 


Persons Frequencies 
1 24 
В 70 
3.. 62 
4.. 52 
5.. 36 
67 23 
m 14 
8.. 8 
oF. 5 

3 
1 
1 
299 
FawrLIES Стлэзлетер BY AGE or Man HEAD 

Age, years Frequency 
Unden25 ees oie а M ae И 13 
25-34.... 

35-44.. 
45-54.. 
55-64.. 
65-74...... 
75 and over 
"Total 
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CHAPTER X 


GROSS RELATIONSHIP BETWEEN TWO FACTORS: 
SIMPLE LINEAR QUANTITATIVE CORRELATION 


One of the most common purposes of social research is to dis- 
cover whether or not there is any relationship between two 
factors, and to measure the amount of the relationship. For 
example, does the number of children in a family tend to decrease 
as the family income increases? If treated statistically, this 
kind of question is called a problem in correlation. Аз will be 
seen below, statistics is able to measure the amount of relation- 
ship (correlation) present in such cases, to provide an equation 
by which one of the factors can be predicted from a knowledge 
of the other, and to estimate the range of error in the predictions. 

1. The Scatter Diagram: Ungrouped Data.—As an introduc- 
tion to the method of simple linear correlation applied to un- 
grouped data, let us test the idea that the largest percentage 
increases of population in the United States between 1920 and 
1930 occurred in regions where the density of population per 
square mile was least in 1920. We shall limit ourselves here to 
examining the amount of correlation in the nine census divisions. 
The necessary figures are given in Table 49. 

TABLE 49.—PrncEwTAGE or Porutation Increase, 1920-1930 (Y), IN 


RELATION TO POPULATION PER SQUARE MILE IN 1920 (X), BY 
Gzocnaruic Divisions, UNITED SrATES* 


Division X Y 

Б 7, „С. ee 3.0 
New England......... У 119 10 
Middle Atlantic... ..| 223 18 
East North Central. . 88 18 
West North Central. 25 6 
South Atlantic......- 52 13 
East South Central.. 50 B 
West South Central.. 24 19 
Mountain... I 4 11 
Pacifig... cesses e ole evo RU PRS 18 4T 
1930, pp. 12-13. 


* From Abstract of the Fifteenth Census of the United States, 
171 
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We may make a preliminary judgment by rough methods as to 
whether or not any relationship is present between the X and Y 
series. Taking the four largest values of X, we find the average 
of the four corresponding Y values to be 14.75. For the four 
smallest values of X, the average Y value is 20.75. In other 
words, as the X values decrease, the Y values tend to increase, 
on the average. This suggests that there is some negative 
relationship between the two series. 

A better way of prejudging correlation is by means of a 
scatter diagram. The X and У values are plotted on rectangular 
coordinate paper, as shown in Fig. 43.1 It is now seen that if 


0 
0 25 50 75 100 175 150 175 200 225 
Fie. 43.—Scatter diagram for Table 49. 


the point for the Pacific region is omitted, the remaining points 
show no discernible tendency either to rise or to fall across the 
table. Any correlation present must, therefore, be due to a 
single case. It would be misleading to say that between 1920 
and 1930 there was a tendency for population in the United 
States to increase at a faster rate in thinly populated regions than 
in thickly populated regions, when as a matter of fact this was 


true in only one out of nine regions. There is accordingly no . 


point in going any further with this problem, unless we wish to 
try areas smaller than census divisions. 

Consider a second problem. Do the counties of Wisconsin 
that have high birth rates also tend to have high death rates? 
Waiving the objections that a county is not always a homo- 


1 For example, the first pair of values constitute a point with the coordi- 
nales (119, 10). To plot this point in Fig. 43, after drawing the horizontal X 
axis and the У axis perpendicular to it, we measure 119 units from the origin 
at 0 along the X axis, then up 10 У units parallel to the У axis, and there 
mark in the point. 
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geneous unit (e.g., a county may be half urban and half rural) 
and that its population is often too small to yield reliable birth 
and death rates, let us compare the first 20 counties of the state, 
taken alphabetically, in 1935. The data are in Table 50. 


"TABLE 50.—Вівтн AND DEATH Rares BY Countes IN Wisconsin, 1935* 


Birth | Death 
County rate (X) rate (Y) Y 
18.6| 97| 180.42] 345.96 94.09 
22.2 | 12.0| 266.40 | 492.84 | 144.00 
18.4 | 10.4 | 191.36 | 338.56 | 108.16 
12.5| 8.3| 103.75 | 156.25 68.89 
22.1 | 11.6 | 256.36 | 488.41 | 134.56 
17.5| 6.9| 120.75| 306.25 47.61 
17.2| 10.3] 177.16 | 295.84 | 2106.09 
15.7 6.8 106.76 246.49 46.24 
20.5 12.1 248.05 420.25 146.41 
17.3| 74| 128.02 | 299.29 54.76 
17.4| 13.9| 241.86 | 302.76 | 193.21 
22.5| 101| 227.25 | 506.25 | 102.01 
Е. 17.1| 13.8| 235.98 | 292.41 | 190.44 
Dodger MEI sde, 144| 9.2] 132.48 | 207.36 84.64 
Шоо ос... 20.8 | 98| 203.84 | 432.64 96.04 
8 CES 162| 12.2| 197.64] 262.44 | 148.84 
Ebo. M 18.7| 93| 173.90 | 349.69 86.49 
0 | 12.2| 263.40 | 484.00] 148.84 
10.5 : 
11.1 


366.2 _ 207.6 . 
М. = 2982 =1831 My = “gg = 1088 


We shall apply the device of the scatter diagram to these 
figures. The results are shown in Fig. 44. 

From Fig. 44, we notice first that the 
Points is limited, none falling below 12 or above 23 on the Х 
scale, and none below 6 or above 14 on the У scale. It is a 
Eeneral precaution that as a rule any correlation found for a 
given set of data should not be assumed to exist outside the 
range of the data. A man may accept a wage of 50 cents an 
hour to work eight hours or perhaps even 12 hours without 


range taken by the 
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resting, but it would be erroneous to suppose from this that he 
would continue to work an indefinite number of hours at that 
rate. After 12 or 14 hours, it would probably require more than 
50 cents to induce him to work another hour. Thus the relation- 
ship between wages (X) and length of work period (Y) would 
not be the same beyond the range of 12 hours as within that 
range. Similarly, counties with birth rates much below 12 or 
above 23 might show death rates entirely out of line with what 
would be expected from the relationship found between birth 
and death rates in the counties included in the study. 


Regression of YonX 
= $ 


082 506 ООО 1 I6; 18 20 22 24° 26 
Fre. 44.—Scatter diagram for Table 50. 

A second fact shown by Fig. 44 is that there is a general 
tendency for the points to rise in the positive direction along the 
X scale. That is, as the birth rates in the counties increase, the 
death rates tend to increase also. This indicates that there is _ 
some positive correlation between the two kinds of rates that 
seems worthy of further investigation. We would not expect a 
high correlation, however, because the dots show considerable 
scatter, instead of following one another in a continuous line 
or curve. i 

It should be pointed out that if the data in Fig. 44 had fallen 
instead of rising in the positive direction along the X scale, a 
negative relationship would have been indicated. That is, 
there would have been a tendency for the death rates to decline 
as the birth rates increased. A negative correlation, of course, 
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shows just as much relationship as a positive correlation of the 
same degree. 

2. The Line of Regression: Ungrouped Data.—In simple cor- 
relation it is customary, whenever reasonable, to regard one of 
the factors, X, as an independent factor, and the other, Y, as a 
dependent factor. Thus, above, the birth rate is taken as the 
independent factor, X, and the death rate as the dependent 
factor, Y, because the birth rate is believed to influence the 
death rate, rather than vice versa. 

Returning to Fig. 44, the next step in the attempt to measure 
the amount of correlation between the X and Y factors is to 
ask what is the form of the observed correlation. From inspec- 
tion of the figure, it appears that the simplest way to represent 
the relationship is by means of a straight line. This is fortunate, 
because the method of simple correlation that is described in 
this chapter deals only with straight-line, or linear, relationships. 
Relationships that take the form of curved lines are measured 
by other.methods. When it seems advisable to use a formal 
mathematical test to determine Y 
whether or not a relationship is linear, 
the description of such a test may be 
found in more advanced texts.’ 

Although, of course, no one line 
will fit all the points in Fig. 44, math- 
ematics furnishes a formula for deter- Oar 45. Geometric mean- 
mining the line of best fit, which is ing of the equation of а straight 
usually called the line of regression Jine; 
of Y on X. The general equation 


У, = ay. + by. X, (65) 
Y axis, and b is the slope 
or the ratio of c to d in 
t that at any point, P, 


саз 2 


of a straight line is 


where a is the intercept of the line on the 
of the line with respect to the X axis, 
Fig. 45. (This follows from the argumen 

=, orc = bX; 


on the line, Y = a + c; but by definition b = y: 


therefore, У = a + bX.) р 
To determine the values of the constants, a and Ь, that will 


give the line of best fit, the following normal equations are used: 


1G. U. Үр, and М. С. KENDALL, An Introduction to the Theory of 
Statistics, pp. 455-456, Charles Griffin & Company, Ltd., London, 1937. 
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b, = ZZY — NM-M, _ МУХҮ — ZXZY  Zzy (66)* 
"^ УХ NM? МЕХ? (SX)? Sa?” 
а = M, — b,.M., (67) 
where the subscripts yz indicate the regression of Y on X. 
From Table 50, we substitute in formula (66): 


_ 3839.32 — 20(18.31)(10.38) 
— . 6843.82 — 200831) ^" 

bys = .27516,1 

а = 10.38 — .27516(18.31) = 5.34182. 
Substituting these values of а and b in formula (65), 
ч У. = 5.3418 + .27516Х. (68) 
Putting X = 12.5 in formula (68), we have 


У. = 5.3418 + .27516(12.5). 
У. = 8.78130. 


Letting Х = 22 


д 


У. = 11.39534. 


Plotting these two calculated points, (12.5, 8.78) and (22, 
11.395), in Fig. 44, we get the line of regression of Ү on X there 
shown. ИХ = 0, Y. = 5.34 =a. 

If the origin is shifted to the means of the two series,t (Fig. 46), 
equation (65) becomes 


Ye = ba, (69) 
where z and y are deviates from their respective means. For the 


* Also, see formula (88). Н 
T These figures are carried to several decimal places to provide a check 
in the summation of the third column of Table 51. If the work has been 
correctly done, this column will sum approximately to zero. 
$ Notice that the mean of the Y, values calculated from the regression 
equation is equal to the mean of the observed Y values. This may be shown 
algebraically by replacing a in equation (65), above, with its equivalent 
from equation (67): 
У. = ay; + b,X, 
Y. = М, — М. + b,.X, 
ЗР: = My — В.М, + М. 
27. = My. 


If the second equation above is expressed in terms of mean deviates, we got 
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present problem, this gives 
Ye = .27516z (70) 


which is a simpler equation and often easier to handle than 
equation (68). The y. values calculated from this equation, 


0| 24 6 8 10121416 1820 


Fro. 46.—Shift of axes necessary to change regression line to mean deviate form 
(уг = bz). 


however, are of course not directly comparable with the observed 
Y's. For that reason equation (68) is used to provide the values 


in Table 51. 4 А 

A measure of the goodness of fit of the. regression line 
У. = 5.34 + .275X to the points in Fig. 44 is given by the 
DOR DEM ЕЕ Верин o е ааа 


(Y. — My) = (M, — My) — b, (M. — М.) + bys(X — М:); 
(Y. — My) = b (X — М.) 
or 
Ye = bysz. 


Subtracting M, from each Y value and M. from each X value in equation 
(65) is equivalent to measuring all Y values from the mean of the Y's, and 
all X values from the mean of the X's. That is the Y axis in Fig. 46 is 
simply moved to the right to the mean of the X’s, and the X axis ìs moved 
чр a distance equal to the mean of the Y's. This, of course, places the 
intersection of the two new axes at а point which has for its coordinates 
the means of the two series (M=, My). Since this point is the origin of the 
system of axes from which all values of X and Y are to be measured, however, 
it is convenient to give it the coordinates (0, 0). This is also necessary if we 


express z and y in mean deviate form as in equation eg because at the 
Point of of every mean deviate must be zero. _ 
pi menna the vae d ove that the regression line always 


It follows from the second equation abo don 
. Passes through the point (М., My), since, if we let X = Mz, 
У. = My. 


У. = M, — bM. + Б.М» 
The same fact appears if we let z = 0 in equation (69): ye = b,. (0), y. = 0. 
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formula for the standard error of estimate, Sy: 


Da? 
Sy =. 25, (71) 
g- PrE ay = ХУ, (72) 


where d is the difference between the observed and the calculated 
У values, and М is the number of paired values. The d’s are 
shown in Table 51: 


or 


TABLE 51.—VaLUuES or d AND d? 


Observed Calculated 


2 
(0) (ҮЭ д g 
9.7 10.45980 — .75980 . 57730 
12.0 11.45037 + .54963 .30209 
10.4 10.40476 — .00476 .00002 
8.8 8.78132 — .48132 .23167 
11.6 11.42286 + .17714 .03138 
6.9 10.15712 —3.25712 10.60883 
10.3 10.07457 + .22543 .05081 
6.8 9.66183 —2.86183 8.19007 
12.1 10.98260 +1.11740 1.24858 
7.4 10.10209 —2.70209 7.30129 
13.9 10.12960 +8.77040 14.21592 
LORE 11.53292 —1.43292 2.05326 
13.8 10.04706 +3.75294 14.08456 
9.2 9.30412 — .10412 . 01084 
9.8 11.06515 —1.26515 1.60060 
12.2 9.79941 -F2.40059 5.76283 
9.3 10.48731 —1.18731 1.40971 
12.2 11.39584 + .80466 . 64748 
10.5 10.23967 + .26033 .06777 
11.1 10.10209 + .99791 . 99582 
207.6 207 .59099 .00001 69.39083 
69.39 
TNR Ni 25) = 186, 


or, using formula (72), 


2234.78 — 5.34182(207.6) — .27510 (3839.32) 
8, = 20 — 1.86. 
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The standard error of estimate is like the standard deviation, 
except that in the case of the latter the Y values are subtracted 
from their mean, while in the case of the former they are sub- 
tracted from the regression line, t.e., from the calculated Ууз. 
Notice in Table 51 that the deviations from regression add to 
zero, just as do mean deviations. If the distribution of Y values 
is normal, two out of three of the observed Y’s will not vary 
from the regression line by more than one standard error of 
estimate on each side. This may be shown graphically by 
plotting in the range +S, from the regression line in Fig. 44. 
Adding and subtracting 1.86 and У. = 8.78 at X = 12.5, and 
then 1.86 and У, = 11.40 at X = 22, gives a range of 
6.92-10.64 at the small end of the scale and a range of 
9.54-13.26 at the large end. Accordingly, only six counties— 
Buffalo, Calumet, Clark, Columbia, Dane, and Douglas—out 
of the 20 are found to fall outside the range F 1,. Thus 30 per 
cent of the cases exceed the range, compared with 32 per cent 
in a strictly normal distribution. This close agreement is in 
spite of the small number of counties in Table 50. А 

There is, of course, seldom any reason for using a regression 
equation to calculate values of У for comparison with the data 
from which the regression equation was obtained. A regression 
equation is rather applied to new data for the purpose of making 
predictions. For example, the usefulness of the regression 
equation (68), based on Table 50, lies in telling us what death rates 
to expect in counties that are not included in the table, or in а 
year other than 1935. 

Even in the prediction of individu : 
however, it is often possible to reach relatively safe conclusions 
by noting the odds in their favor. For example, the most 
probable value of Y corresponding to an X value of 18.6 was 
found by substituting X = 18.6 in equation (68), giving 

‚ Y. = 10.46. In other words, if we know that a county had a 
birth rate of 18.6, we can predict that its most probable death 
rate is 10.46, and we can feel some confidence that its actual 
death rate will not usually be below 8.60 or above 12.32 (i.e., 
10.46 + 1.86). If we wish to be surer, the odds are about 
20 to 1 in a normal distribution that the death rate of this 
county will fall between 10.46 F (1.86 X 2), i.e., between Gyi 
and 14.18. If practical certainty is required, only once in some 


al Y values when 7 is low, 
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369 times in a normal distribution will the death rate exceed 
the range of 10.46 = (1.86 X 3), or 4.88 to 16.04, inclusive. 
The spread of possible error is now large, but the advantage over 
random guessing is still considerable. This is usually true even 
after making allowance for the fact that the distribution is not 
normal, and for errors due to sampling. 

The same principle applies to a variety of related questions, 
е.д., What is the probability that a county with a birth rate of 
17 will have a death rate as low as 8 or as high as 12? Sub- 
stituting X — 17 in regression equation (68), we find Y, — 10, 
approximately. The difference between the expected death 
rate of 10 and a death rate of 8 or 12 is +2. If we regard the 
death rates of all counties whose birth rate is 17 as normally 
distributed about a mean of 10, with а standard deviation of 
Sy = 1.86, then the difference +2 lies 2.00/1.86 = 1.08 standard 
deviation units above or below the mean. Referring to a table 
of normal areas (Appendix Table 1), we see that practically 36 
per cent of the area of the curve falls between the mean and an 
ordinate at 1.08c. Hence we may say that a deviation as great 
as or greater than 1.08¢ may occur above or below the mean 
100.0 — 2(36) — 28 times in 100. "The odds are therefore 72 to 
28, or roughly 25 to 1, against such an event. 

In equations (65) and (66), b, which is the slope of the regres- 
sion line of Y on X, is called the regression coefficient. It is a 
useful measure, since it shows the number of Y units that the 
most probable value of Y changes for each unit change in X. 
For example, in equation (68), Y, = 5.34 + 0.275X, the regres- 
Sion coefficient is 0.275, which means that the most probable 
value of Y increases 0.275 of a Y unit for every X unit that 
X increases. If the equation were Y, = 5.34 — 0.275X, the 
most probable value of Y would decrease 0.275 of а unit for 
each unit that X increased. 

3. The Coefficient of Correlation : Ungrouped Data.—Although 
the table of X and Y paired values (Table 50), the scatter 
diagram (Fig. 44), the regression equation of Y on X (formula 
(65)), the regression coefficient b, and the standard error of 
estimate S, give а great deal of information about the amount 
and nature of the relationship between two variables, X and Y, 
none ОЁ them furnishes in a single figure an index of the amount 
of the relationship. This is supplied by the simple Pearsonian 


Ц 
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coefficient of correlation, r, which for ungrouped data may be 
found from the following formula: 

ee ZXY — NM-M, (3): 
V/ (2X? — NMZ)(ZY? — ММ) 
Applying this formula to Table 50, 
я $ 3839.32 — 20(18.31) (10.38) 
^/ [6843.82 — 20(18.31)7[2234.78 — 20(10.38)?] 


{ 38.16 
~ 4/(138.7) (79.89) 
т = 36. 


. Since r is a coefficient that can vary only from 0 to +1, this 
is not a high value, indicating rather low relationship between 
the birth rates and death rates in the 20 sample counties of 


! Alternative formulas, which are sometimes convenient, are 
NzXY — zXzY 
т= , (74) 
VINZX: = (ZX)ylNZzY: — (ХҮ) 
> 2 
azY + b2X¥ -N Gr 


тї = ЗУ (75) 
zy? – м (27) 
>Х>Ү 
x — — 
r= ЕНЕС I D (76) 
(2x)? _ (21Ү) 
== 202" - SF] 
zXY — NM.M,, (77) 
r= = 
Nozty 
rent. ES (78) 
Я 
т = М/ бб, (79) 
в? + ой — oD", (80) 


LU Maz oy? 
Where D refers to the differences between the raw paired values. This is 
nown as the difference formula. 


А zy 152.0 2220, (81) 
Мос, N oz % 

сы (82) 
ту 


Where o is the standard deviation of the Үг calculated from the regression 


equation., See also formula (89). 
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Wisconsin. It is about what would be expected from the scatter 
diagram (Fig. 44). 

The labor of computing a correlation coefficient from ungrouped 
data can sometimes be reduced by dividing one or both series by 
some appropriate divisor, or by subtracting an arbitrary constant 
from the values of either or both series. As will be seen, this 
does not affect the value of т. The method also applies to the 
regression equation, provided the original values are restored. 

4. Size of Sample from Which г Is Calculated.—It is assumed 
throughout the discussion of this chapter that the coefficient of 
correlation, 7, is not calculated from very small numbers of 
paired values, say less than 25. If this assumption is not met, 
and the data are regarded as a sample, many of the formulas 
given need correction. Since small-sampling theory is omitted 
from this text, the student may see certain references listed at 
the end of this chapter for its treatment.! 

b. The Meaning of the Correlation Coefficient, r.—It has already 
been seen that the standard error of estimate, S,, around the 
regression line for Table 50 is approximately 1.86. "The variance 
of the observed Y's is 


of = 27 - (22); (83) 


P, 78 _ (2078), [2v 


ву? = 
ву? = 4. 


If we compare 8,2 with оу?, we shall have a measure known ав 
the coefficient of alienation RUM k?: 


№ = Бу = (1.86)? = 0.865. * (84)? 


This shows that 86.5 per cent of the variance in county death 
rates remains in the form of “scatter” around the regression 


1 бее, for example, Yule and Kendall, Ezekiel, Fisher, and Croxton and 
Cowden. The student should not be misled by the circumstance that, in the 
example of Table 50, 20 pairs of values were treated as a large sample. This 
was done only for convenience of illustration. Strictly, small-sample 
methods should be used with 20 cases, although even for that size of sample 
it often makes no important difference. 

2 Compare the distances of the dots from the regression line and from the 
mean of the Y's in Fig. 44. 


D 
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hne, which is not controlled by the birth rates. Again, by 
formula (78), 


#=1— Бу, 
Oy“ 
uL Те? 
9 т 1-— №, 
Г = 
А k Lar 
т? + К? = 1. (85) 


That is, r? and k? together account for 100 per cent of the variance 
in Y. Since we have just seen that k? indicates the percentage 
not controlled by X, r? = 1 — k? evidently indicates the per- 
centage controlled by X through the medium of the regression 
equation. Thus, above, 7? = (.36)° = .13, meaning that a cor- 
relation of r — .36 accounts for only 13 per cent of the variance 
of the Y series. This interpretation of 7? is further clarified by 


formula (82) squared, 


су 
2 is the variance of the Ye series calcu- 


Fera the numerator, ое”, 
ated from the regression equation, 50 that its value is entirely 


Controlled by X. 
Substituting the values of 7* 
problem above in formula (85), we get 
(.36)2 + .865 = .995, 
or 99.5 per cent, the slight variation from 100 per cent being due 


to approximations in the calculation of 7? and k’. 


i Notice, in general, that an т аз large as .71 is required to cut 
he variance of Y by 50 per cent (if r? = .50, then 


r= y .50 = 71). 
о be built up of simple elements 


n Y but some of which 
measures that 


and k? found in the illustrative 


Where both X and У are assumed $ 
of equal variability all of which are present 1 
are lacking in X, it can be proved mathematically that r? 
Proportion of all the elements in Y which are also present in X. For 
that reason, in cases where the dependent variable is known to be 
causally related to the independent variable, т? may be called the 


e i Eri 
Oefficient of determination.! 


доловат Ezer, Methods of Correlatio 
ley & Sons, Inc., New York, 1930. 


п Analysis, р. 120, John 


184 ELEMENTARY SOCIAL STATISTICS 


Although these assumptions seldom hold in practice, it is 
customary to regard r? as a better measure of relationship than r. 
At any rate, 7? is a more conservative estimate. 

Does the correlation between the birth rates and death rates in 
Table 50 mean that the birth rate is the cause of the death rate? 
Obviously, being born is not the cause of dying. Sanitary 
conditions, medical service, and various other factors determine 
death rates. It happens, however, that infants are more sus- 
ceptible to death by disease than are older children and adults, 
so for this reason, other things being equal, the population with 
the largest proportion of infants will have the highest death 
rate. In general, it may be said that the presence of simple 
correlation between two factors may or may not be accompanied 
by a direct or efficient causal connection between them. Often 
simple correlation is due to common causes, as when teachers’ 
salaries and the amount of money spent for alcoholic beverages 
rise and fall together with changes in business conditions. There 
is much danger that this kind of correlation will be misinter- 
preted. Sometimes, as in the case of the birth and death rates 
above, one factor is a necessary antecedent but not a direct 
cause of a correlated factor. Very rarely, two factors show a 
high but purely accidental correlation, as the yield of potatoes 
in Great Britain with, say, smallpox epidemics in the United 
States. The safest interpretation is that the presence of corre- 
lation between two factors indicates that as one increases the 
other tends to increase or decrease, t.e., they vary together to some 
extent. Why they vary together may be determined by further 
statistical and experimental methods, such as those of partial 
correlation and the laboratory, which seek to control the various 
interfering factors involved. 4 

Caution should be used in comparing two or more values of r. 
It often happens that interfering factors, of which the investigator 
takes no account, cause two r's that should be the same to differ 
widely, or two r's that should differ widely to appear the same. 
Unless “other things are equal," at least broadly, such compari- 
sons have little point. 

6. A Convenient Formula for the Regression Equation When 
т Is Known.—When the value of т is found before the regression 
equation is set up, the latter may conveniently be obtained from 
the equation 
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Y.— М, = 22 (X — М.), (86) 
or z 
= 0 
у= ra. (87) 
Comparing formula (87) with formula (69), it is seen that 
b = cs 
or 
т = бита, (88) 
бу 


7. Simple Linear Correlation Applied to Grouped Data.—The 
method of dealing with simple linear correlation developed above 
applies to ungrouped data, such as shown in Table 50. In the 
case of grouped data, the principles and procedures are the 
Same, except that formulas (89) through (92) are specially 
adapted for use with frequency tables. 


REIP (89) 
M 515? 
where 
Bry = уй, — RE - зар (90) 
“gat gua – ЕУ, (91) 
Dy? = хў, — 2ш (92) 


2 and у are mean deviates, d. represents unit step deviations 
from an assumed mean of the X’s, dy represents unit step devia- 
tions from an assumed mean of the Y's, N is the total frequency 
ОЁ pairs in the table, f. is the total frequency of pairs in an X class 
or column, f, is the frequency of pairs in a Y class or row, and 
Л is the frequency of pairs in a cell. These symbols appear in 
the margins of correlation Table 53. , 1 ч 
It is reasonable that the proportion of children in a state s 
Population should influence the percentage of the state’s income 
at is spent for schooling. Let us measure the extent to which 
this is true. The data needed are in Table 52. For our purpose 
it is not necessary to weight the percentage figures by the state 


Populations. 
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TABLE 52. РЕВСЕМТАСЕ OF POPULATION UNDER 19 YEARS oF AGE IN 1930, 
AND PERCENTAGE THAT SCHOOL EXPENDITURES WERE OF ALL 
INCOME IN 1928, BY Srares* 
ТН oa Eao 


tion under 19 years | expenditures 
SEG of age, 1930 were of all in- 
(X: come, 1928 (Y) 
Southeast: 
Virgini 44.4 2.01 
- Саг 49.3 4.38 
S. Carolina 50.6 3.16 
eorgia. 46.3 1.75 
Florida. . 39.2 5.76 
Kentucky. 43.9 2.29 
"Tennessee. 43.8 2.57 
Alabama 47.0 2.74 
ississippi 3:94 
Arkansas 2:55 
Quisian: 44 2.61 
Southwest: 
Oklahoma 44 3.27 
Texas... 42 2.57 
N. Mexico 46 3-40 
rizona... 42 EN 
Northeast: 
aine.. 37 1.93 
Ham) 35 2:14 
Vermont 37 2.24 
аззас! 35 1.85 
R. Island 37 1:89 
onnecti 37 2.46 
York 33 2.11 
N. Jersey 36 3:20 
Delaware. 35 1:91 
Pennsylvania. 39 2:20 
37 1.97 
з. 
3. 
3. 
2 


34 28 

2 
38 5 
38 5 
37 82 
35 46 
45 13 
42 78 


i 42 02 
Wyoming 39 

Colorado. 38 29 
Utah. 16 91 


California. 
United States 


*From T. J. Woorrer, Jn. Landlord and Tenant on the Cotton Plantation, WPA 
Research Monograph V, 1936, p. 141. 


E 
dis ac rc PEE A T 
DREN moNoO-mnm чышоченн= ньроноооноюы HO 


waww (ooo соол в 
© 


The 48 pairs of values in Table 52 are hardly enough to justify 
grouping, but are convenient for illustrating the grouped method. 
The entries in Table 53 are made from the ungrouped data of 
Table 52, as follows. X represents the percentage of the popu- 
lation under 19 years of age, and Y is the percentage that expendi- 
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tures for school purposes were of total income in 1928. The 
first state in Table 52 has X = 44.4, so it will fall somewhere 
in col. 44.0-45.9 of Table 53. Since the corresponding Y value 
is 2.61, a tally is entered in row 2.40-2.79 of col. 44.0-45.9. 
Similarly, the second state has an X value of 49.3 and a Y value 
of 4.38, so a tally is placed in col. 48.0-49.9 and row 4.00-4.39 
of Table 53; and so on. After all the entries are tallied in the 
cells, the tallies are counted and replaced by numbers. 

In Table 53 we then see two ordinary frequency distributions, 
X and Y, placed at right angles to each other and exhibiting : 
a double classification. The large figures in the cells are the 
frequencies. Instead of making a scatter diagram, as we did 
with ungrouped data, let us estimate the mean of the Y’s in 
each column of the table. Consider, for example, the column 
with the heading 34.0-35.9. We have for the mean 


(2.6 X1+22X2+18 X 2) у 
5 


This may be marked by a small circle at the Jeft side of the 
column, although if it did not interfere with reading the table 
it should be located at the mid-point of the column. Similar 
Circles indicate the positions of the means of the other columns 
which have a frequency as large as five. An inspection of these 
means shows that they have an irregular tendency to rise in the 
Positive direction across the table. This suggests some positive 
Correlation between X and Y. However, the circles form more 
of a-curve than a straight line, rising to a peak in the 38.0-39.9 
column and then descending slightly. If we suppose that we 
are dealing with a sample thrown up by a particular set of 
causes, some of the irregularities may be due to random factors 
and a small sample. But even if we make allowance for the 
extreme cases in cols. 38.0-39.9, 42.0-43.9, and 44.0-45.9, the 
curved effect is not lessened. To assume that the relationship 
18 linear and estimate the amount of correlation on that basis 
Will reduce the value of the coefficient slightly, compared with 

© use of a coefficient of curvilinear correlation. Since we 
Cannot deal with curvilinear correlation here, we shall use the 
Simpler straight-line hypothesis. There is also some justification 
or this in view of the fact that the large scatter indicates a low 
Correlation in any case. 
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The line of regression of Y on X, and dotted lines representing 
Ж LS, the values for which are worked out below, are drawn in 
the correlation table (Table 53). А study of them in relation 
to the entries in the correlation table should be helpful, just as 
it was in the case of the scatter diagram for ungrouped data 
(see Fig. 44). It appears from Table 53 that the actual relation- 
Ship changes from strongly positive in the left half of the table 
to moderately negative in the right half, whereas the linear. 
regression implies а constant positive correlation throughout. 
Also, the linear equation is far from fitting the data of the two 
halves of the table equally well. On the other hand, in only one 
column does the proportion of items falling outside the range of 
one standard error of estimate around the regression line exceed 
the normal one-third. In practice it would probably not be 
worth while to carry the analysis any farther. We shall, how- 
ever, use the table to show the steps involved in calculating the 
Pearsonian correlation coefficient, r, the linear regression of Y 
on Х,1 the standard error of estimate, Sy, and other statistics, 
from grouped data. 

Proceeding with Table 53, we enter unit-step deviations in row 
(2) and col. (2). The entries in row (3) and col. (3) and in row 
(4) and col. (4) are familiar and should be obvious from the 
Symbols. Next, we multiply each cell frequency first by 4. and 
place the product in the upper right-hand corner of the cell, 
and then by d, and place the product in the lower left-hand 
corner of the cell. The 4. products are then added by rows and 
the d, products by columns. Column (6) and row (6) are 
obtained by multiplying the entries in col. (5) and row (5) by 
d, and da, respectively, and the products are summed over the 
column and the row.? 

We finally substitute from Table 53 in formulas (90)-(92), 


Day = 79 — (19)2§ = 69.1, 


үр (25)? _ 
хл? = 315 — “Ge = 302, 


SS ols (19)? 2 
zy? = 317 — “Ge = 309.5. 


1 There is also always a regression line of X on Y, from which the most 
Probable values of Х may be calculated for given values of Y. The two 
regression lines are not the same. To find the regression of X on Y, simply 
change places with Х and Ү in the equations given in this chapter. 

* Ав a check on the work, notice that in Table 53 cols. (3), (5), and (6) 
should have the same totals as rows (5), (3), and (6), respectively. 
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Substituting these values in formula (89), 
69.1 69.1 69.1 


T7 78056055)  /93469 3057 
т = .28. 
This value of т indicates very little relationship. Nevertheless, 
for purposes of demonstration, we shall show the use of the 


formulas for finding the regression equation of У on X and the 
coefficient of alienation, k. We have 


Уту 
bye! = я (93) 
b, = 391 = 0.23. 


But this value of b is in terms of unit-step deviations or class 
intervals. To change it back to scale units, 


» + 
Е = (94) 
where 2, = class interval of Y. 
iz = class interval of X. 
bye = .23 = 0.046. 
ав = My — 6,.М.. (95) 
tyz = 3.16 — .046(40). 
аџ = 1.32. 
Therefore, substituting in formula (65), we have 
У. = 1.32 + .046X 
also, 
8,2 = oy(1 — r3), (96) 
‚| Zfudy? я 
ot = dg E = (34) | (97) 
oy? = (.4)? [340 — (33)7], 
оу? = Й 
S, = 1.03(1 — .0529), 
Sy? = .9755, 
S,  .9755 
8 = 20. IGI 
= ткт = 96. 


Thus ар г of .23 leaves 95 per cent of оу? as scatter around the 
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regression equation, or improves prediction only 
т® = 1 — k? = .05, 


or about 5 per cent, in terms of the variance of the Y’s. 

The student is asked to check the plotting of the regression 
line and the lines showing the standard error of estimate in 
Table 53. 

x Regression equations (69), (86), and (87) also apply to grouped 
ata. 

From the above, it is clear that there is little tendency for the 
percentage of income expended for schools to be proportionate 
to the percentage of children under 19 years old in the population 
when states are taken as units and a linear relationship is assumed. 
Apart from the latter assumption, which has already been dis- 
cussed, it may well be objected that a state is a large area, 
within which very different relations between these two per- 
centages may exist. Thus a large city and a rural county in 
the same state may be more sharply unlike in this respect than 
two cities in separate states. For this reason, the average . 
relationship given for each state as & whole is likely to be unrep- 
resentative, and so to lack meaning. It would be much better 
if the data were available by school districts, in which case а 
higher correlation might be found. 

8. The Rank Correlation.—A method of linear correlation that 
takes account of the rank orders of paired items but disregards 
their values is sometimes used for rough work, or when the 
values of the items are not known. The formula is 

62D? ї 
р=1- ўуз 1) (98) 


where D is the difference between Ше ranks of a pair of items, 
and N is the number of pairs. 

As an illustration of the use of this formula, let us refer back 
to Table 50, and rank the counties with respect to their death 
rates and birth rates, as shown in Table 54. When there are 
ties, as between Douglas and Eau Claire counties in death rates, 
and Clark and Fond du Lac in birth rates, the tied items are 
given the mean of the ranks they would occupy if they were not 
equal, and the next item takes the rank just above the highest 


1 p is the lower-case Greek letter rho. 
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rank used in finding the tied mean. For example, the ranks 7 


and 8 are averaged to give 4 t - 7.5 as the mean rank of 


. | 
Clark and Fond du Гас counties, and Columbia county has the | 
rank 9. | 


TABLE 54.—TwrENTY Wisconsin Counties RANKED wiTH RESPECT TO 
Вівтн Rates AND Dears Rares (Low то Hiau) | | 


| 
1 4 3 | 
2 5 3 9 | 
Calumet... 3 1 2 4 | 
Douglas... 4 17.5 —13.5 182.25 
Dane.... 5 19 —14 196 
Burnett. . 6 10 —4 16 
Glance oil. 7.5 3 4.5 20.25 " 
Fond du Гас... 7.5 18 — 5.5 30.25 
Columbia................ 9 20 —11 121 
LEH AC Sean Е 10 2 8 64 
Отепов e ee 11 12 -1 1 
Баттол ФА. MOT 12 11 1 1 
"AGRIS de eoe eher ede 13 7 6 36 
DUNN, Meat К АМЫР ТЫКЕ 14 6 8 64 
Chippewa................ 15 16 —1 1 
ЖООБ] st айе берн 4 16 8 8 64 
Кап Сайга... ies i рыу 17 17.5 — .5 .25 
ие Зоо 18 14 4 16 
ДААШапа; и 19 15 4 16 
Стауѓогӣ...........,.... 20 9 11 121 
шз в hemos New SEEMS LLL 972 
ТШ Оозул, ICT ge ÉD 


Substituting in formula (98), 
6(972) ^ 


[iiber 20(400 — 1) = 27. | 
Like г, the value of p may vary from --1.0 to — 1.0. 


Exercises 


1. a. What is the amount of relationship between the length of French 
and English words in the accompanying table? Plot the data, and dis- 
cuss the scatter diagram. Is the relationship. reasonably linear? Use 
both the ungrouped and the grouped methods of calculating r as a 


— 
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check. Do the two methods necessarily give exactly the same value of 
r? Explain. Just what does r.mean in this case? 


NUMBER or LETTERS IN A SAMPLE OF FRENCH Wors (X), AND IN THEIR 
Nearest Емашмзн EquivALENTS (Y) 


X Y Y 
1 2 4 5 8 8 
9 9 6 6 9 10 
8 8 6 4 5 5 
6 7 8 9 9 9 
4 7 5 10 4 5 
i vi 4 4 5 4 
7 7 0 7 3 3 
6 6 5 6 6 5 
8 11 8 9 12 11 
8 8 5 6 5 5 
8 11 5 4 8 9 
8 7 3 3 11 11 
8 7 4 3 8 8 
9 8 7 5 7 6 
7 5 7 8 И 8 
5 5 6 7 7 7 
6 5 9 7 7 7 
12 9 8 6 6 5 
5 8 5 4 0 9° 
8 7 6 2 9 9 
10 10 5 4 9 8 
7 8 8 8 8 9 
5 8 6 7 9 11 
8 9 9 9 8 7 
7 8 9 8 5 6 
10 11 7 7 
8 8 7 8 
3 3 8 5 
7 7 5 6 


m both the ungrouped and the 


b. Get the regression of Y on X fro: 
aks and explain what a and 6 mean 


grouped data, as a check, plot the line, 
in the equation. \ d 
. €. What is the most probable length of an English word correspond- 
ing to a French word of six letters? 
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4. Within what range will the number of letters in the English words 
in (a) fall two times out of three? Ninety-five times out of 100? 
Practically always? 

е. What is the value of the coefficient of alienation squared, and 
what does it mean here? i 

f. What is the coefficient of determination and its interpretation 
in this problem? 

9. Find the coefficient of rank correlation, p, for the same data, and 
compare its value, meaning, and adequacy with r. 


2. For the table below, find the value of r and of b, and compare 
them in meaning. 


AcE or Елтневз (Y) CORRELATED WITH AGE or Sons (X) 


3. Find т? and k? for the following table, 


Момвев or Сниовех 
AND THE AVERAGE № 


and explain their meaning. 


IN THE First GENERATION ог Srx FAMILIES (X) 
UMBER OF CHILDREN IN THE SECOND GENERATION 


OF THE SAME FAMILIES (Y) 
x 3 4 6 7 9 15 
K 3 2 4 4 5 i 
Ee Ч ый е 


4. а. By inspection, is there a; 


ny relationship between the votes of 
the states in 1876 and in 1932? 


If any, is it positive or negative? 


REPUBLICAN VOTE ror PRESIDENT IN NiNE SrATES, 1876 AND 1932 


State Per cent of vote Republican 
1876 1932 

Massachusetts. ............ 58 48 
New York.... i 48 41 
WOOD, КЫЛ, Legs s 55 32 
тоат, ВРИЕ 41 35 
Virginia.. m 41 30 
Mississippi. . ja 21 4 
Louisiana.......... T 48 7 
Nevada... 53 31 
California. 


5 51 39 
ыы ГОИН 
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b. What does the scatter diagram show? 

c. What is the equation of the regression of Y on X, where X is the 
percentage of the vote Republican in 1876, and Y is the percentage of 
the.vote Republican in 1932? Plot the line in the scatter diagram. 

d. What is the standard error of estimate? Plot it in the scatter 
diagram. 

€. What is the most probable percentage of the vote Republican in 
1932 of a state that voted 55 per cent Republican in 1876? 

f. Assuming a normal distribution about the regression line, within 
what limits of error will the percentage vote fall two out of three 
times? 20 out of 21 times? Within what limits of error does it 
actually fall in each case? 

5. a. What does the scatter diagram show in the case of the accom- 
panying table of death rates in Connecticut and Massachusetts? 


Draru RATE IN CONNECTICUT AND MassACHUSETTS* 


ICA ызны желдет 1924 | 1928 
Connecticut........... 11.3 | 12.0 
Massachusetts......... 12.0 | 13.0 


D. C. Heath 


* From В. Н. Camp, The Mathematical Part of Elementary Statistics, p. 144. 
& Company, Boston, 1935. 
24, what is the most 


b. If the death rate for Massachusetts is 12 in 19 
ear in terms of the 


probable death rate for Connecticut in the same y 
relationship between the two? 

c. How much of the variance still re 
death rate in Connecticut from one in Massachusetts? 


mains as scatter in predicting a 
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CHAPTER XI 


GROSS RELATIONSHIP BETWEEN TWO FACTORS: 
NONQUANTITATIVE CORRELATION 


1. Qualitative Data.—The method of correlation described up 
to this point has dealt with quantitative series only, e.g., birth 
and death rates, and proportion of state income spent for educa- 
tion. It often happens in sociological investigations, however, 
that it is needed to know the amount of relationship between 
two factors, one or both of which are qualitative. Examples of 
qualitative factors are rural or urban residence; personality 
ratings like Annoying, Unsympathetic, Sympathetic; occupa- 
tional classes—Professional, Proprietor, Clerical, Skilled, 
Unskilled; and so on. Methods for correlating data of this 
type have been devised. Before using them, effort should be 
made to convert the qualitative attributes into quantitative 
variables, because the latter are usually more accurate and 
reliable. Thus, a student might be classified by the number of 
credits earned in college, rather than as Sophomore or Junior. 

2. Reliability of Classification.—Since much depends on the 
reliability with which the nonquantitative variables are classified, 
it is advisable to have the classification repeated by two or more 
qualified persons. If the results are very different, better 
criteria for classification should be developed, or the problem 
dropped. 

This point may be illustrated. The questionnaire that the 
members of a class in statistics filled out regarding their previous 
training in mathematics called for the sex of each student. If 
it were desired to correlate success in mathematies with sex, 
the members of the class might be divided by sex, and then 
Subdivided into, say, four groups according to the average 
grades received in mathematics. This would give a table like 
Table 55. Мү? 

The question of the reliability of the classification by class 
standing in this table can be dismissed, because it is based on & 
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quantitative variable, the average grades received in mathe- 
matics. The sex classification might be somewhat unreliable 
if it depended merely оп the Christian names of the students 
in the questionnaires; but reference to the questionnaire used 
shows that the students checked the words Male and Female. 
The reliability of this classification can therefore also be accepted 
with confidence. We may then proceed to find the amount of 
relationship between the two factors in the table. 


TABLE 55.—Srupents IN A STATISTICS CLASS GROUPED BY SEX AND GRADES 
RECEIVED IN MATHEMATICS 


Students by class standing 
Sex 


i 2 3 4 | Total 


All classifications are not so simple as those in Table 55, how- 
ever. In Table 56, for example, a second competent person 


› With respect to the economic status of the family. 


Тавин 56.—Есоҳоміс STATUS OF THE Елм iN Wacan Paroten Was 
REARED AND OUTCOME on PAROLE 


Parole violators 


Status Parole; |.— — — Е 


Number | Per cent 
ое 287 44 15.3 
Moderate... ES 261 26 10.0 
Gomtortable it alice emit. NL 59 6 10.2 
ноут oe stds ae a 22 3 * 
ть Ке, Дә, „б, et 629 79 


* Sample too small to warrant ап estimate, 


3. Choice of a Method.—After the reliability of the classifica- 
tions in a nonquantitative correlation table has been established, 


араак. лы 
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the question of how to calculate the amount of relationship 
between the two factors in the table arises. The answer depends 
on the nature of the particular factors to be correlated. It is 
convenient to set up a key, as in Table 57, which will suggest 
what method should be used in each case. 

The terms in Table 57 need definition and illustration. Quan- 
titative means expressed in countable units, as crime rates or 
heights of male freshmen. Qualitative refers to nonmeasured 
traits, like those mentioned in the first paragraph of this chapter. 
Qualitative Ordered refers to qualitative categories that can be 
arranged in ascending or descending order, as Favorable, Indif- 
ferent, Hostile. Qualitative Unordered applies to qualitative 
categories that cannot be arranged in ascending or descending 
order, e.g., Law, Medicine, Engineering. A dichotomous series 
is a series of two mutually exclusive and exhaustive categories, 
as Good, Not Good; Sick, Not Sick; Male, Female; College 
Graduates, Others; Families with Less than Four Children, 
Families with Four or More Children. 


Tarun 57.—Key то Sevecrep МЕтнорз or NONQUANTITATIVE 
CORRELATION 


Method 


Variable B 


Variable А 


Quantitative: several classes | Dichotomous Biserial, тыз 


Quantitative or qualitative: Qualitative: ordered or un- | Contingency, C 


ordered ог unordered;| ordered; several classes 
Several classes or dichotomous 


Tetrachoric, т, 
Yule's Q 
Fourfold r4 


ANNIS Жармы c 


It is not feasible to deal here with more than the five methods 
listed in Table 57, though а number of less prominent methods 
àre omitted. 

4. Biserial Correlation. 
United States in 1929, it i 


Dichotomous Dichotomous 


—In a study of divorce data for the 
s desirable to know whether there 1s 
any correlation between the party to whom the divorce was 
granted and the number of children affected. The data are 
Shown in the first four columns of Table 58. We have here & 
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quantitative series to be correlated with a dichotomous series. 
According to the key in Table 57, this requires the biserial 
method of correlation. 

The biserial method that we shall employ assumes that the 
dichotomous trait is normally distributed and continuous (@@, 
there is no gap in the series, and no disarrangement of an ordered 
series). The relationship must be linear, or that of a straight 
line. In the present case the idea of normality at first seems to 
have little meaning. However, if we think of the possibility of 
measuring the extent to which the husband or the wife is respon- 
sible for the granting of the di- 
vorce, and if it is reasonable to 
suppose that one party will sel- 
dom be wholly the instigator, 
but that in most cases both will 
be about equally involved, we 
may perhaps assume that the 
distribution of the dichotomous 
factor is fairly normal. 

0 1 т 3 4 5,6 1 8 9 0 Since all reported divorces 

Ес. Его E Ec M of AM included, the series is con- 
divorces granted to husband and num- tinuous. As a rough test 
ber of children involved, whether or not the relationship 
is linear, the scatter diagram shown in Fig. 47 is used. In this 
figure are plotted the percentages of the divorces granted to hus- 
bands by the number of children affected. The trend, if any, is 
very irregular. Where there are no children, a much larger per- 
centage of divorces is granted to the husband than where there are 
'*hildren. When the number of children is very large—i.e., 
eight or nine—the proportion of divorces granted to the husband 
falls to a minimum. When the number of children affected 
ranges from one to seven, the Proportion of divorces granted to 
the husband remains practically stationary. The low percent- 
ages of divorces granted to the husband when there are eight or 
nine children may be unreliable because of the small number of 
cases involved; but the circumstance that the percentage is low 
both for eight children and for nine children tend 
the observed figures. There seems + 
lating the value of ты, in this case, 
show the method. 


8 to support 
о be little reason for calcu- 
We shall do so merely to 
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TABLE 58.—Drvorces GRANTED, CLASSIFIED Accorpine TO NUMBER OF 
CHILDREN AFFECTED: 1929* 


Divorces granted 


Children 
affected Total Per 
т B cent to 
o husband | То wife (f husbandt) 
(1) (2) (3) (4) (5) 
0 36,840 76,970 113,810 .3287 
1 8,385 32,223 40,608 .2065 
2 4,255 15,242 19,497 ‚2182 
3 1,841 6,161 8,002 ‚2301 
4 774 2,571 3,345 .2314 
5 352 1,191 1,548 .2281 
6 155 518 678 .2303 
7 68 245 313 .2173 
8 22 108 130 .1692 
9 16 77 93 .1720 
Hiotal s о 52,708 135,306 188,014 .2803 
Мей 0.55 0.77 0.71 
1, U. 8. Bureau of the Census. “Nine or more 


* From Marriage and Divorce, 1929, p. 4 


children” taken as nine, and "no report as to children” disregarded. 


e 58 is not linear. We 


Apparently, the relationship in Tabl 
on the assumption that 


shall work out the correlation, however, 
it is linear. The difference is unimportant here. 
The formula for finding biserial r is 
М. — m (pq), (99) 
т y 


ler frequency distribution [cols. 


Тыз = 


where 7i; is the mean of the smal 
(1) and (2) of Table 58], М. is the mean of the larger frequency 


distribution [cols. (1) and (3)], с is the standard deviation of the 
total frequency distribution [cols. (1) and (4) p is the propor- 
tion that the total frequency of the larger distribution [col. (3)] 
is of the grand total frequency [со]. (4), ¢ = 1 n and y is the 
height of the ordinate of & normal curve of unit area and unit 
standard deviation at the point separating the area of the curve 
“into the proportions p and q, as found from Appendix Table 1. 
The means and standard deviation required are calculated by 
the usual method of unit-step deviations from an assumed mean. 


The required values are 
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mı = 0.55, | 
Ms = 0.77, | 
o = 1.14, | 
135,306 _ | 

р = 18804 = 72 


q = 1.00 — .72 = .28, 

To find у, we turn to Appendix Table 1. In Fig. 48 a normal 
curve is shown. Аз explained elsewhere, the values given in the 
body of the table represent the proportion of the area of the 
curve included between the mean ordinate (shown at zero in the 
figure) and ordinates erected at various distances, measured in 
standard deviation units, from the mean. Since here р = .72, 

y 


Fig. 48.—Normal curve used to find value of y in formula (99). 


we need to find the height of the ordinate which divides the 
curve so that .72 of its area falls to the left and .28 to the right. 
Evidently, .72 of the area will occupy the whole left half of the 
curve, and a proportion .72 — .50 = 92 will extend into the 
right half. Looking for .22 in the column of the table headed 
“Area,” we find as the nearest approximation to it the figure 
0.2190, and note that the corresponding figure in the column i 
headed “Ordinate (y) ” is 0.3372. We therefore have y = 0.3372, { 
and are ready to substitute in formula (99): 


n. = (0:27 — 0.55} (.72)(.28) 
18 1.14 133721 | 


ы ка, „19, 


As would be expected from our preliminary analysis of Table 58 
and Fig. 47, the amount of linear relationship between divorces 
granted to husbands and the number of children affected is very 
slight. 
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The sign of ть, indicates the direction of the relationship 
between the quantitative factor and the proportion of cases in 
the distribution represented by p in formula (99). Here there 
is a slight positive association between number of children and 
divorces granted to the wife, or a slight negative association. 
between number of children and divorces granted to the husband. 

The general conclusion from this analysis is that, if any corre- 
lation is present at all, there is a very slight tendency for the 
husband to receive the divorce relatively less often as the number 
of children increases. Much more informative, however, was 
the interpretation made from the scatter diagram in Fig. 47, 
that the proportion of divorces granted to the husband (1) was 
considerably greater where there were no children at all, (2) was 
little affected by increases in the number of children from one to 
seven, and (3) was а minimum when the number of children 
was eight or more. 

Biserial correlation is a special adaptation 
correlation used in finding the Pearsonian coefficient of correla- 
tion, r, for quantitative data. For this reason, љь шау be 
regarded as the nearest approximation to T that can be found 
when a quantitative series is corre 
Series. 

5. The Coefficient of Contingency.—A total of 1,118 inmates 
of a state prison were classified as murderers, sex offenders, and 
property offenders. It was wanted to know how much, if any, 
correlation existed between these three criminal types and intel- 
ligence. An intelligence test was given to all the men, with the 
results shown in Table 59. This table contains one quantitative 
series and one unordered qualitative classification. The key 


in Table 57 indicates the method of contingency for finding the 


amount of association present. This coefficient is based on the 
Chi-square (x?) method, and measures the amount of deviation 
of the observed frequencies in the table from purely random or 
chance frequencies. The method of finding the chance ог 
theoretical frequencies, fi, 1$ based on two elementary theorems 
in the mathematics of probability which have already ba 
treated (see Chap. IX). Thus, the probability that айу ааа 
will fall in, say, the first column of Table 59 is the ratio о : 

total number that fall in that column to the total frequency 


of the method of 


lated with a dichotomous 
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of the table, or ;/N = 70/1,118, where n; is the total frequency 
in col. (1), and N is the total frequency of the table. Likewise, 
the probability that any criminal will fall in, say, the first row 
is the ratio of the total number that fall in that row to the total 
frequency of the table, or m = 17/1,118. Now the probability 
of two independent events occurring together is the product 
of the probabilities of their separate occurrences. Therefore, 
the probability that any criminal will fall in both the first column 
and the first row of the table is 


mi\ (in\ _ ти: _ (_70 WN =. 
(3) (=) wb. i En) (255) 0.000352 


This means that about one out of every 1,000 prisoners in Table 
59 may be expected by chance alone to fall in the cell common 
to col. (1) and row (1). Since there are 1,118 prisoners in the 
table, the expected frequency is 


ша jj Quang (1,118) = 1.0644. 


This formula may evidently be shortened, however. to 


7 = AU (100) 


aud 7(70 г 
giving for the above f; = : o = 1.0644, again. We now 


y in row (1) and col. (2) of Table 59. 
By use of formula (100), all of the expected frequencies are 
calculated and entered in cols. (2), (7), and (12). This compu- 
tation is more easily done for any column by setting n;/N in 
the calculating machine and multiplying it successively by the 
total row frequencies, т. It is a general principle of the x test 


that no cell should contain much less than five expected fre- 
quencies. Any cell that offends in this respect should be com- 
For this reason, in Table 


bined with the cell above or below it. 

59 the frequencies of the first row and of the last two rows a 
combined with those just below or above. Comparing now the 
observed with the theoretical frequencies, We notice а consider- 
able amount of difference. This indicates some н: 
between the criminal classifications and intelligence. 


proceed to measure it by computing x^: 


write this expected frequenc; 
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Ta (fo — f? 101 
emu aon 

x? = 19.087 + 25.673 + 7.230 = 51.990 
Substituting in the formula for the coefficient of contingency, С, 


C= Ar Ly (102) 
urn a m 
C= ивы = V 0445 
= .21 


The amount of association between the types of criminals and 
intelligence is seen to be low. If we regard our 1,118 prisoners 
as a random sample, what is the probability that the value of C 
is zero in the total population from which it was drawn? Before 
we can refer this question to a table of x? (Appendix Table 2), 
we must have regard for the proper degrees of freedom. It will 
be recalled! that in each row and column of a contingency table 
(e.g., Table 59), one of the cell frequencies is not “free,” because 
it may be determined by subtraction from the marginal totals. 
In any row or column, therefore, the number of free cell fre- 
quencies, or degrees of freedom, is one less than the number of 
cells (columns or rows). In Table 59 there are three columns 
and six rows, so that the degrees of freedom for the whole table 
are (3 — 1) (6 — 1) = (2)(5) = 10. With 10 degrees of freedom, 
we find in Appendix Table 2 that a x? as great as 23 would occur 
by chance only once in 100 trials. Since our x? — 52 is still 
larger, we can be sure that the differences are not random. 
That is equivalent to saying that the value of C indieates a low 
but genuine association between types of criminals and 
intelligence. 

C is usually found from a shorter formula than that used above: 


C= сш (103)? 
where А 
2 
5 = t), (104) 


1 See Chap. IX, p. 148. 
2 For the derivation of this formula, see Karl J. Holzinger, Statistical 
Methods for Students in Education, p. 275, Ginn and Company, Boston, 1928- 
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The value in parentheses in (104) is calculated for each cell of the 
table, and these cell values are summed over the table. Thus 
for the cell 80-89 in col. (6), 
fe) _ a5 _ 
E = 24853) = .0074. 

Formula (103), however, does not provide a value of x? by 
which to test the significance of the association found. 

The coefficient of contingency has the defect that it under- 
States the amount of correlation actually present, in inverse pro- 
portion to the number of cells in the table. For a 3 X 3 table 
having perfect correlation, C would not be 1.00, as it should, but 
-816; for a 5 X 5' table, the maximum value of C is .894; for a 
7 X 7 table, .926; for a 10 X 10 table,.949. Evidently C is not 
Comparable between tables with different numbers of cells. 
For these reasons, it is well to apply C only to tables having 
Say from 25 to 100 cells. 

Я It is possible to correct С to some extent for the above fault 
11 cases where the correlation table has a fairly normal surface, as 
shown by the row and column totals in ordered series. For this 
Purpose Table 60 may be used in connection with formula (105): 


я_ С 
=> 105 
С П; (105) 


If for the moment we regard Table 59 as normal, we have from 
Table 60 for three columns & = .859, and for six rows t, = .959, 
80 that 


я 21 
= 7 = -25- 
С = 959.859) 
TABLE 60.—Facrons ror Correcrinc С ror Broan Окооріма* 
Number Correction Factor raabe Correction Factor 
(try te) (try te) 
2 .798 9 .981 
3 .859 10 .985 
4 .915 1 .987 
5 .943 12 .989 
6 .959 13 .991 
7 .970 14 .992 
8 .976 15 .993 


* From С. C. Perens and W. В. VAN Vooruis, Statistical Procedures and Their Mathe- 
™atical Bases, р. 398, McGraw-Hill Book Company, Inc., New York, 1940. 
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The change in the value of C in this case is slight, and will 
always be so where the original value of Cislow. The correction 
is therefore worth making only when the value of C is fairly 
high. Moreover, in the present case, one of the series in Table 
59 is unordered, so we are not justified in regarding it as approxi- 
mately normal in form, or in applying this correction to the C 
obtained from it. 

A coefficient of contingency, C, needs perhaps even more 
careful interpretation than other coefficients of correlation. In 
the first place, it has no sign, so that its meaning is dependent 
upon an examination of the correlation table itself. When both 
series are ordered, it is possible to assign a sign to C; otherwise, 
not. In Table 59, the prisoner classification is unordered, so 
the C we found can have no sign. Notice also that the sizes 
of the х? for the three classes of criminals are not comparable, 
because the number of prisoners is different in each class. · Ме 
may, however, compute the mean I.Q. for each of the three 
classes, and in that way note how they compare in intelligence. 
Thus we find that property offenders are most intelligent, with 
an I.Q. of 79.96, while murderers and sex offenders are approxi- 
mately equal with I.Q.s of 75.71 and 74.51, respectively. If 
the categories were Life Sentence, Medium Sentence, Short 
Sentence, instead of Murderers, Sex Offenders, Property Offend- 
ers, the sign of C might be regarded as negative, since intelligence 
increases as the length of prison sentence decreases. If neither 
factor in the table was quantitative, means could not be com- 
puted. In that case, we could only compare the columns with 
respect to the proportions of their frequencies falling in each 
category of the stub. 

6. Correlation in Fourfold Tables.—Any scale may be divided 
into just two parts, or dichotomies. For example, we may meas- 
ure head lengths, and then classify heads below a certain length 
as short, and those of this length and above as long. Many 
sociological variables that have never been measured are com- 
monly treated as dichotomies, e.g., Cooperative, Not Coopera- 
tive. Some information is gained if а more detailed breakdown 
is feasible, such as Completely Cooperative, Very Cooperative, 
Average Cooperative, Uncooperative, Completely Uncooperative. 

Some qualities are most conveniently regarded as attributes 


rather than as quantitative variables, and naturally take a 
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dichotomous form. Examples are Violator of Parole, Non- 
violator of Parole; White Race, Other Race. 

The measurement of the amount of relationship in a 2 X 2 
table is usually rather rough and inexact, regardless of what 
method is used. On this account, such a table is often merely 
tested for the presence of relationship, without attempting to 
Measure it. The Chi-square test, explained in Chap. IX, is 
commonly relied on for this purpose. 

Suppose we are interested in whether or not there was any 
association between the occupation of agriculture and the 
tendency to commit crime in a given state over the period 1920— 
1930. Table 61 gives all the information at hand bearing on the 
question, together with the scheme of symbols used in a short 
formula for x? adapted to a 2 X 2 table. 


TABLE 61.—OCCUPATIONAL DISTRIBUTION OF THE ADULT MALE PRISON 
AND Nonprison POPULATIONS OF A Given Stare, 1920-1930 


Occupational Mean prison | Mean nonprison Total 
classification population population 
Agriculture... 690 (u) 1,100,000 (v) 1,100,690 (ı7) 
Nonagriculture, ......... 2,310 (w) 900,000 (z) | 902,310 (m) 
TG COT CUN 3,000 (nj) | 2,000,000 (л>) | 2,003,000 (№) 
= 2 
pans (ux — vw) N, (106) 
NyN2( Msn), 


Substituting in this formula, 


2 _ [(690) (900,000) — (1,100,000) (2,310)]2,003,000 
(8,000) (2,000,000) (1, 100,690) (902,310) 
x? = 1,239. 


x 


7 


Entering Appendix Table 2 with one degree of freedom, as we 
in Chap. IX, we see that so large a value of x? would occur 
Y chance much less often than once in 100 times. We may, 
іо, regard the presence of association between the occupa- 
" n of agriculture and the commitment of crime in Table 61 
2 established beyond doubt. 
it Seems worth while to go farther than the x? test, and try 
° estimate approximately the degree of association in a 2 X 2 
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table, there are several coefficients available. They are based 
on different principles, however, and give different results. 
We shall illustrate three such coefficients, namely, Yule’s Q, 
the ordinary coefficient of correlation adapted to fourfold tables, 
т, and the coefficient of tetrachoric correlation, r. Where one 
of them will not meet the needs of a particular problem, another 
usually will. 

ит vw 


uc (107) 


The formula for Yule's Q is Q = 
where the symbols refer to cell frequencies as shown in Table 
61. Let us apply it to the data of Table 61. Substituting, we 
have 

Q — (690)(900,000) — (1,100,000) (2,310) 
(690) (900,000) + (1,100,000) (2,310)" 
Q = —.61. 


According to this coefficient, there is а moderate amount of 
negative association between the occupation of agriculture and 
imprisonment for crime in Table 61, or, more generally, between 
the first column and the first row factors, when the positive 
and negative factors (e.g. Prison Population, Nonprison Popula- 
tion, Agriculture, Nonagriculture) are arranged as in the table. 
The result appears reasonable when it is noted that men usually 
engaged in agriculture formed only 690/3,000 — 0.23 of the 
prison population, but 1,100,000/2,000,000 — 0.55 of the non- 
prison population. 

Notice that Q = 0 if vw = uz, or if u/w = v/z; that Q = +1 
if v and/or w is 0; and that Q = —1Н u and/or 5 іѕ 0. In other 
words, in Table 61, Q would show (1) zero association if the cell 
frequencies represented a purely random distribution of the 
table totals; (2) perfect positive association if all of the prison 
population, and/or none of the nonprison population was 
engaged in agriculture; (3) perfect negative association if none of 
the prison population, and/or all of the nonprison population 
was engaged in agriculture. The requirement for perfect asso- ` 
ciation is less stringent than if “and/or” was replaced by “and” 
above, but Q is appropriate for treating the data of Table 61, if we 
are interested in the proportion of the prison population drawn 
from agriculture, as compared with the proportion of the non- 
prison population drawn from agriculture. 
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Should we want to measure the extent to which farmers and 
prisoners are strictly identical or exclusive categories, we may 
use the formula 


ut — vw x 
fr ; (108) 
: V mannm ММ 
Which assumes v = w = 0 (Table 61) for perfect positive associa- 
tion, u = z = 0 for perfect negative association, and (like Q) 
vw = ux for no association. For Table 61, 


(690) (900,000) — (1,100,000) (2,310) 


-V (1,100,690) (902,310) (3,000) (2,000,000). 
T4 = —.025. 


In view of the fact that the proportion of agriculturalists in 
e prison population was under half that in the nonprison 
Population, the value of r4 seems to be entirely too low, while 
t © value of Qis about what would be expected. It seems extreme 
9 Insist that for perfect negative correlation the total nonagri- 
Cultura] population, but not a single farmer, must be in prison, 
88 the formula for та requires. For other problems, however, 
E may be more appropriate than Yule’s Q. This suggests that 
he choice of a measure of correlation should be adapted to the 
Particular problem and interest of the investigator. 
E e two Coefficients, Yule’s Q and r4, are both designed for the 
in a Case where the frequencies are impressionistically divided 
di 0 two groups, or, in geometric terms, roughly collected at two 
Screte points, In Table 61, these points are Agriculture and 
Sn eSticulture for one factor, and Prison population and Non- 
10 population for the other. "-— 
Scal hen the frequencies are distributed along two quantitative 
da and on each scale they are divided into two groups by a 
tion Оп the scale, and it is desired to find the amount of correla- 
рг between the paired scale values rather than between the 
“Portions of cases in the two dichotomies, the so-called tetra- 
orie method is appropriate if the underlying mathematical 
fact ibtions mentioned below can be met. In Table 63, the 
qu Or, size of household, is reduced to two classes from the 
Ап ауе distribution on Table 62; the other factors, Relief 
Onrelief, is qualitative, like the categories of Table 61. 
бое р are difficulties in the computation of the tetrachoric 
lent, 7, but an approximate formula is: 


Та = 


212 ELEMENTARY SOCIAL STATISTICS 
1 1 2hk(vw — ux) 5 
пу Ni = Ака (109) 
where, reading from a table of normal areas and ordinates 
(Appendix Table 1), 


5 T m MN 
h is the 5 value at .5 у г „Б N 
Ne 


В 2 EE E 
k is the = value at .5 Wor 5 N 


Н is the height of the ordinate at Л, 
К is the height of the ordinate at k, 
and the other symbols have the same meanings as in Table 61. 
The derivation of formula (109) assumes that both of the 
series (e.g., size of households and relief-nonrelief) are normally 
distributed, that both dichotomies are continuous, that the 


TABLE 62.—DisTRIBUTION or RURAL RELIEF AND NONRELIEF HOUSEHOLDS 
BY Size, OCTOBER, 1933* 


Households 
Size of household 


Nonrelief Total 


10 persons and over 246 536 


9 persons 213 415 
8 persons 336 689 
7 persons 560 1,053 
6 persons 997 1,630 
5 persons. . 1,322 2,156 
4 persons 2,061 2,907 
3 persons 2,408 3,254 
2 persons en з 95 2,430 3,175 
1 регѕоп..........- 627 985 
All households 11,200 16,800 


* Adapted from Thomas С. McCormick, Comparative Study of Rural Relief and Non- 
relief Households, p. 88, Research Monograph II, Works Progress Administration, Division 
of Social Research, Washington, D. C., 1935. Mid-point of last interval taken as 11. 


1 An alternative formula is 


т ууш ` А 
uc — 0 
n ET lwe + Vm (0 
where x = 180°, 


the symbols are arranged as in Table 63, and the sign of r;is interpreted as in 
the case of Yule’s Q above. 
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total frequency of the table is large, that the dichotomous divi- 
sions are not made too far toward the extremes of their dis- 
tributions, and that the relationship is linear. If the table is 
not normal, the value of r, is affected by the point of division 
of the dichotomies, i.e., by whether each series is divided in the 
middle of the scale or at some other point. 

In view of these restrictions, it hardly seems legitimate to 
apply т, to Table 61 above. As Fig. 49 shows, the dichotomous 


a 2 X 
Criminality o Criminality 
low - high 
а. 49.—Proportion of adult male population in prison, Table 61. 


line is drawn at the far upper end of the distribution of criminal- 


у › Where the value of т; is very sensitive to any skewness in the 
tail of the curve, 


ТАвгьв 63.—Nuwnzn or Вовль RELIEF AND МоквЕглЕЕ HOUSEHOLDS 
СохтАтихо Less THAN Four Persons, AND Four PERSONS AND 
Over, Остовев, 1933* 


Frequency 
Size of household 
Relief Nonrelief Total 


3,651 (u) 5,735 (v) 9,386 (1n) 
1,949 (w) 5,465 (2) 7,414 (an) 
5,600 (nı) | 11,200 (n:) | 16,800 (№) 
лое in the table should be so arranged that the value of the independent factor (size 


ЇЧ) increases from the bottom row to the top row, and the value of the dependent 
Cronreffeqonomic independence) increases from the first column (relief) to the second column 


Cn inspection of Table 62 suggests that the distribution by 
ed household is somewhat skewed. We can test this, how- 
the » by shifting the position of the dichotomous line, and noting 
> “ect on the value of T, If т, remains rather stable, it is 
ence that the distribution is normal enough for the use of the 
"achoric method. There is no way to judge the normality 
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of the dichotomous series, relief-nonrelief, except by rationaliza- 
tion. If we believe that the families in the two groups together 
would form an approximately normal distribution when classified 
according to some index of dependency, say, per capita family 
income, we may be justified in proceeding. The size of household 
series is continuous (4.е., without gaps in the scale), and the 
same may be said of the relief-nonrelief series, although the 
nonrelief households were all nearest neighbors of the relief 
households, and represent a restricted nonrelief group. The 
dichotomous divisions in Table 63 do not fall too near the tails 
of their respective distributions. The condition that the total 
frequency should be large is also met. “A rough test of linearity 
of relationship between the two series in the table could be made 
in the same way as was done in the case of biserial r (p. 200), but 
we shall not be risking a great deal if we dispense with it. 
Turning to the definitions of h, k, etc., above, we find 


т _ 9,386 _ 
М = 16,800 = 0.5587, 
n; _ 11,200 _ 
М ^ 16,800 ~ 0.6667, 
so that 
0.5 — 0.5587 = —0.0587, 
0.5 — 0.6667 = —0.1667. 


From Appendix Table 1 we read, 
Corresponding to the “Area” entry 0.0587, 


h = 0.14+. 
Corresponding to the “Area” entry 0.1667, 
k = 0.43+. 
And the corresponding ordinates, 
H = 0.395, 
K = 0.364. 
m Table 63 we have 
N = 16,800, 
v = 5,735, 
w= 1,949, 
и = 3,651, 
т = 5,465. 
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Substituting in formula (109), 


= ЖИР СЫЙ ae 
"= (ip) fı (ть) 
[asso _ GIC 3,919) (6,738) = CASED, 


(895) (.364) 
1 Гу PhD 
0602 16,800/” 
un —.21. 


From Table 63, we see that 65 per cent of relief households 
have four or more persons, compared with only 51 per cent of 
nonrelief households. We therefore say that the degree of 
economic independence of a household is to a slight extent nega- 
tively correlated with the size of the household, as shown by the 
value r, 2 — 91. 

А quick method of finding the value of т; is provided by Г. 
Chesire, M. Saffir, and L. L. Thurstone's Computing Diagrams 
fer the Tetrachoric Correlation Coefficient. We shall use one of 
these diagrams (Fig. 50) to test the normality of the size-of- 
household Series in Table 62 by recomputing 7; after shifting the 
dichotomous line of division from three- to five-person house- 
holds, The new groupings are shown in Table 64. The fre- 
Quencies are reduced to proportions of the table total, 16,800, by 
multiplying the reciprocal of 16,800, 


Ш 


1 
а = {0 2, 
16,800 0.00005952, 
into each cell frequency. Theproportionsare entered in Table 65. 
* now take any row or column total that is not greater than 
500 as а, any other column or row total at right angles to it as b, 


T. 
са 64.— Момвев or Вовль RELIEF AND Nonrevier HOUSEHOLDS 
ONTAINING Luss THAN Sıx PERSONS, AND Srx PERSONS AND OVER, 
OCTOBER, 1933 


Frequency 
Size of household 
Relief Nonrelief Total 
G | M 
Pons and more, cs шу vats ce aot 1,971 | 2,352 | 4,328 


Persons and less. . 
‘Ouseholds 


3,629 8,848 | 12,477 
5,600 11,200 16,800 
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TABLE 65.—FREQUENCIES OF TABLE 64 REDUCED TO PROPORTIONS OF THE 
TABLE TOTAL, FOR Usp WITH ОнЕзїнЕ, SAFFIR, AND THURSTONE’S 
ComruTING DIAGRAMS 


Frequency 
Size of household 
Relief Nonrelief Total 
6 persons and more............ —.117 +.140 .257 
5 persons and less............. +.216 = с —.527 .743 = b 
All households................ .333 =a -667 1.000 


and the proportion in the cell common to the а row (or column) 
and the b column (or row), as c. One set of these letters is 
indicated in the table. From Fig. 50, the diagram for a = .33, 


eso} =. 22778 

Nus. 2 d 

„ГУ 72 En 
да 

0807022222 

012222 
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Fre. 50.—Sample computing diagram for the tetrachoric correlation coefficient. 
(From L. Chesire, M. Saffir, and L. L. Thurstone, Computing Diagrams for the 
Tetrachoric Correlation Coefficient, University of Chicago Bookstore, Chicago. 
1933.) 
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we find at the intersection of the orthogonal? lines representing 

$ = .74 and с = .22 a value г, = 28. If in Table 65 c falls 

in a positive quadrant, т, has the sign shown in the diagram; but 

if c is in а negative quadrant, the sign indicated in the diagram 

is reversed. The signs of the quadrants are marked in Table 65, 

where it is seen that c is in a positive quadrant. Therefore, 
1 At right angles. 
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7, = —.23, which agrees closely with the value of r: computed for 
Table 63 with a different division of the dichotomy for size of 
households. So far as this test goes, then, the table seems to be 
normal enough to permit the use of the tetrachoric method. 
The test should be made for other points of division on the 
size-of-household scale, but would still be incomplete because 
new subdivisions cannot be tested in the relief-nonrelief series 
2130.1 


It should finally be observed that а fourfold correlation table 
includes some of the basic elements of experimental design. 
Thus, in Table 61, we have an independent factor or treatment, 
Agriculture; a dependent factor, Imprisonment for Crime; an 
experimental group, the Prison population; and a control group, 
the Nonprison population. On the other hand, dichotomies are 
used instead of classes based on measurement. In Table 61, 
sex and (roughly) age have been held constant, and there is 
nothing in the method that precludes as rigorous factor control 
as seems worth while. Even the broad 2 X 2 table may, there- 
fore, be a valuable analytical device. E 


1 If it is needed to determine the value of the tetrachoric coefficient, ғ, 
very precisely, the complete formula may be seen in several texts, e.g., 
Davenport and Ekas, Statistical Methods in Biology, Medicine and Psychol- 
ogy, 4th ed., pp. 105-106, or Peters and Van Voorhis, Statistical Procedures 
and Their Mathematical Bases, p. 370; and helpful tables with explanations 
are given by Karl Pearson, Tables for Statisticians and Biometricians, 3d ed., 
Part I, pp. xxxvi, xliii, 1, liii, 31, 32, 33, 34, 42-52, 52-57; Part II, pp. xliv, 
73,74. Formulas have been derived for the standard errors (see Chap. XII) 
of biserial r, the coefficient of tetrachoric correlation, and the coefficient of 
contingency. The standard error of the coefficient of contingency, C, is 
hardly needed if the value of x? for the contingency table is referred to a 
table of x2, as was done above in the section on this coefficient. The formula 
may be seen, however, in such texts as Holzinger, Statistical Methods for 
Students in Education, p. 278. The standard error of the tetrachoric corre- 
lation coefficient, re is also given by Davenport and Ekas, op. cit., р. 108, 
and by Peters and Van Voorhis, op. cit., p. 371. С. U. Yule and M. G. 
Kendall, An Introduction to the Theory of Statistics, p. 408, show the formula 
for the standard error of 7s. Somewhat simpler are the standard error 
formulas of тыз and Q: 


а, 
E RSS Е 111) 
Érbis VN ( 
esp die amem d 
“=== МЕ | (112) 
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Exercises 


1. What are the chief disadvantages in correlating qualitative series, 
as compared with quantitative series? 

2. What preliminary test should be made of a qualitative table 
before applying correlation to it? 

3. What is the amount of correlation between type of college training 
and success in teaching in the following table? 


Two Hunprep Нісн SCHOOL TEACHERS CLASSIFIED ву TYPE or COLLEGE 
ғвом Унасн THEY GRADUATED, AND By SUCCESS IN TEACHING 


Institution Successful | Unsuccessful | Total 


Teachers соШеде..................--. 100 
University or college 100 


Defend your choice of a coefficient, and explain the meaning of your 
results. 

4. How much association, if any, is there between the sex of distin- 
guished people and the socioeconomic class of their fathers in the 
table below? 


Famous British MEN AND WOMEN CLASSIFIED BY SOCIAL ORIGIN* 


Socioeconomic class of father Men Women 
Nobleman.: i: ss ies Sere ере sess 1,059 108 
Gentleman...... 724 83 
Politician, lawyer. 3s 666 61 
Soldier, sailor......... ee 490 53 
Пе cis iis drurin асе sma nra emm tinis, 1,100 57 
Teacher... А 274 28 
Physician... 396 35 
Administrator. 194 12 
Writer, artist. . 371 109 
Businessman. . . . 929 95 
Pe E ETT 446 38 
Laborer, servant wis 81 8 
Agriculture. о.е 270 18 

Bota ЖШС ТГ Ба аав 7,000 700 


* Adapted from Table 4, p. 708, Joseph Schneider, Class Origin and Fame: Eminent 
English Women, American Sociological Review, Vol. 5, pp. 700-713, 1940. 
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Is the association positive or negative? What does the association 
mean in terms of this problem? What should be done about the correc- 
tion for broad grouping in this case? Is the value of C significantly 
greater than zero? Explain what this means. 

5. What is the amount of association between the sex of a sample of 
undergraduate students at the University of Wisconsin in 1938-1939 
and their state of residence? What coefficient is most appropriate to 
this problem, and why? Interpret its meaning. 


A SAMPLE or UNDERGRADUATE STUDENTS, UNIVERSITY OF WISCONSIN, 
1938-1939, CLASSIFIED BY Sex AND BY STATE OF RESIDENCE 


State of residence Male | Female | Total 
AUS ован НБ S 94 44 138 
Oblitas sis vus ios Rais орка Gale var es BESS 17 27 44 
А acorn arses Quit а Не Ае е Sar 111 71 182 


6. Find the amount of association between type of offense and body 
build in the table: 


CRIMINALS CLASSIFIED вх ТҮРЕ or OFFENSE AND Bopy Вопр* 


‚ * From А. E. Hooton, The American Criminal, Vol. I, Appendix, IX-8, Harvard Univer- 
sity Press, Cambridge, 1939. 


Is the value of the coefficient significantly greater than zero? What 
does the coefficient mean here? 

7. What is the amount of correlation between the age distributions 
of females in the neighboring urban and rural counties in the accom- 
Panying table of age distributions? 
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AGE DISTRIBUTIONS OF FEMALES IN A RURAL (RUTHERFORD) AND A NEAR-BY 
Ursan (Мескьемвовс) County IN Мовтн Carona, 1930* 


Age, years Rural county | Urban county 

Indexes socii ый 2,553 6,542 
BD еа erstes 2,846 7,311 
ТО 1. „мн 2,428 6,424 
15-19 2,247 6,751 
20-24.. as 2,109 7,862 
A E ERE E E фы» every 1,579 6,990 
30-32 105... еа а ae eu 1,201 5,277 
35—44. 2,202 8,288 
А 45-54.... è 1,520 5,199 
ББ y. E $ 932 2,548 
65-74.. 1 515 1,842 
75 эла о+ет.......... 4 237 586 
Totals пеген ues 20,369 65,120 


* From Fifteenth Census of the United States, 1930, Bureau of the Census. 
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СНАРТЕВ ХП 
SAMPLING AND SAMPLING ERRORS 


1. Definitions.—In sociological research, it is seldom possible 
to study more than a part of the whole, or universe,! in which 
we are interested. For example, if it is wanted to know whether 
the educated or the unedueated in the United States have the 
higher birth rate, it would be impractical to find the birth rate 
of the millions in each class. A sample would have to be taken 
of each group, and the birth rates of the two samples compared. 
If the samples were large and properly taken, the sample birth 
rates should be rather close to the true rates for the total edu- 
cated and uneducated in the country. 

A value (e.g., a mean) found from a sample is called a statistic, 
whereas the corresponding true or ezpected value in the universe 
is called a parameter. The primary purpose of all sampling is 
to learn something about a universe, often to estimate the value 
of a parameter from the value of a statistic. There is seldom 
any interest in a sample or in the value of a statistic for its own 
sake. A good sample is, therefore, one that yields reliable 
information about a universe. 

The first step in sampling is to define the universe to be 
Sempled. Thus we might define the universe of the educated 
аз consisting of all married couples in the United States living 
together through the year 1939 who had successfully passed at 
least the first year of high school; and the universe of the unedu- 
cated as corresponding couples who had less schooling than 
this, the birth rates to be compared as of the year 1939. The 
Sociological universe should usually be defined in both space 
and time. 

Since the universe is made up of events, a definition of the 
event is also necessary. In our illustration above, the event is a 
Married couple with a birth or a married couple without a birth 
during 1939. There are thus two kinds of events, couples with 


y Synonymous terms often used are population and parent. 
221 
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a birth, which may be called successes, and couples without a 
birth, which may be called failures. The word “success” 
merely designates that particular event among two or more 
different kinds of events in which the investigator is chiefly 
interested. If we were sampling farmers to find their net annual 
incomes in 1939, the event would be a farmer’s income, and would 
represent a continuous, measured variable. In the case of a 
measured variable, there is, of course, no dichotomy of success 
and failure, but merely a number of specific values. 

A universe may consist of an infinite or of a finite (limited) 
number of events. If the number of events is very large, the 
universe may be regarded as infinite for practical purposes. 

The events in a universe may already have happened, or may 
be yet to happen. In the former case they are said to be existent; 
in the latter, hypothetical. In our illustration above, at the 
beginning of the year 1939 none of the events (a birth or 
the absence of a birth to a married couple) has happened; at 
the end of the year, all of them have happened. Similarly, heads 
or tails is a hypothetical event before tossing a penny, an existent 
event after tossing. When the universe to be sampled consists 
entirely of completed events, the universe is said to be exist- 
ent; when it consists entirely or partly of events yet to come, 
it is said to be hypothetical. Prediction, with which social 
science must be concerned, is of course possible only with 
respect to hypothetical universes, since we do not “predict” 
past events. 

It is also important to notice whether the universe to be 
sampled is to be regarded as a unique, historical set of events 
(situation), as a constant or recurrent situation or system of 
causes, or as a changing situation. If we are interested in the 
death rate from the influenza epidemic of 1918, we have a unique 
universe. But if we attempt to predict the rate of mortality 
in Chicago, we assume a continuous or recurrent, i.e., essentially 
unchanging, universe. As a matter of fact, strictly continuous 
or recurrent universes never occur in social research, since there 
is constant change in the complex of factors that compose any 
social situation. The important question, therefore, is whether 
the universe can be expected to be approximately recurrent, or 
unchanged, over a period in which we are interested. If so, we 
may be justified in trying to predict what will happen in that 
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period on the strength of what has occurred. It is sometimes 
possible to discover the nature, direction, and rate of change in 
a changing universe, so that we can allow for it in making a 
prediction. 

Finally, we shall find it worth while to distinguish between 
homogeneous and heterogeneous hypothetical universes. A uni- 
verse is homogeneous when each hypothetical event has the 
same a priori probability of becoming a success or a specified 
value of a variable; it is heterogeneous when this probability 
is not the same for each hypothetical event. A homogeneous 
universe derives from a single set or system of causes, a heteroge- 
neous universe from two or more distinct sets of causes, as judged 
by their effects on the hypothetical events in which we are inter- 
ested. When an insurance company sets up a class of “risks,” 
composed of, say, males, native white, married, in the legal 
profession, aged 25, class “A” medical examination, living in 
Michigan, the company is trying to create a homogeneous 
universe. Every person or hypothetical event admitted to the 
class must be judged alike in respect to certain characteristics 
that are believed to be related to the event, death. In other 
words, each member of the class must have the same apparent 
chance of death. In this way, and by requiring that the condi- 
tions of life for the class must go on essentially in the future 
as in the past (e.g., in case one of the insured persons enlists in a 
war, his contract may be modified or canceled), some likelihood 
is created that the system of causes affecting the mortality rate 
of the class will continue each year about the same as the year 
before, except for chance factors. If, however, a number of 
men aged 65 were to be admitted to the risk class originally 
composed of men aged 25, heterogeneity in the hypothetical 
events would at once be introduced. While such a mixed or 
heterogeneous universe might be recurrent if the proportion 
of the two ages were kept constant, it could no longer claim to 
be homogeneous, because the chance of death is known to be 
different for a man aged 25 and a man aged 65. In practice, 
Just when a hypothetical universe may be considered homo- 
£eneous is a matter of information and of degree. Of course, no 
two persons in a life-table category actually have exactly the 
Same chance of death. The more completely the causes that 
are related to the success are controlled and equated from event 
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to event, however, the more accurate and reliable the prediction 
from the sample will tend to be, within limits. Where to stop 
the effort to increase homogeneity is a question of judgment and 
expediency. The more homogeneous the categories of any 
classification are made, the greater their number, and the fewer 
the events that will fall in any one category. While it is usually 
advisable to sacrifice the size of the sample for the sake of homo- 
geneity to a certain point, diminishing returns set in if the idea 
is carried too far. 

2. Taking the Sample.—The events in a sample may be drawn 

from the universe (1) at random, (2) at regular intervals, (3) 
at random from different strata or subclasses of the universe, or 
(4) according to some purposive scheme, such as from the middle 
and ends of a distribution. Thus, (1) we might draw marriage 
certificates at random from an alphabetical list of all of the 
marriage certificates in a file, (2) we might take every fifth 
certificate in order, (3) we might draw a proportional number of 
certificates at random from each separate county and city list, or 
(4) we might take certificates from the top, middle, and bottom 
of the list. The most common method of taking a sample, and 
the one to which most of the statistical theory of sampling applies, 
is the random. The method of sampling at random propor- 
tionally from within strata—e.g., marriage certificates taken at 
random from each county file—is more representative than 
random sampling from the total universe—e.g., certificates 
taken from a grand list. Unfortunately, however, the sampling 
errors of only a few statistics are available in the case of stratified 
sampling. Purposive sampling is seldom as reliable as either 
of the other two methods, and difficulties of determining sampling 
errors are encountered. We shall deal here primarily with 
random sampling, but shall introduce stratified sampling for а 
mean and for a proportion. 

A sample is random when at any given draw or trial, considered 
alone, every existent event has an equal chance of being taken, or 
every hypothetieal event is equally likely to occur. In other 
words, in & random sample the chance of being "drawn" or 
“thrown” is independent of the character of the event. In 
addition, a simple sample—also called a Bernoulli sample, after 
a French mathematician who studied it—requires that the 
probability of drawing or throwing а success or a specified value 


t 
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shall remain the same from one draw or trial to another.! It is 
theoretically possible to random sample any universe, but a 
simple sample can be drawn only from an infinite universe. 
Suppose we have an existent universe of 1,000 marriage cer- 
tificates and wish to take a random sample of 100. Suppose, 
further, that 60 of these marriages have ended in divorce. At 
the first draw, the probability of taking a marriage that has 
ended in divorce is 89, or 0.060. If we happen to draw a 
divorced marriage at the first trial, the probability of getting 
a divorced marriage at the second draw will be ууу, or 0.059; 
otherwise it will be 3995, or 0.06006. Now if the certificates are 
drawn entirely independently of the question of divorce, at any 
draw one certificate will have as good a chance of being taken 
as any other in the file, and the resulting sample will be random. 
But because the probability of drawing a given event, say, a 
divorced marriage, changes from one draw to the next in the 
limited universe of 1,000 certificates, the sample will not be 
simple. If, however, after each draw of a certificate from 
the file of 1,000, its number is recorded in the sample and the 
certificate is returned to the file, the number of certificates in 
the file will remain constant and the probability of drawing a 
divorced marriage will not change from draw to draw. By the 
act of replacement the universe becomes infinite. Of course, if 
we happen to draw the same certificate more than once, it will 
have to be accepted in the sample each time it is drawn, if it 
is wanted to maintain an infinite universe. 

In the case of a hypothetical universe, a simple sample of 
hypothetical events can be drawn only if the universe is homo- 
geneous, like a life insurance risk group. All the causes that 
determine the chance of death that the actuaries have been 
able to consider must be the same for each individual in the 
group. -If a random sample were drawn from a mixture of 
two different risk groups, so that, say, persons of different 
sexes were included in the sample, the chance of death would 
not be the same from one hypothetical event to another (person 
to person), and the sample would be further removed from a 
simple sample. 

1 But it is not assumed that the probability is the same for different kinds 
of events or different values, e.g., 2 person of age 25 and a person of age 30 
1 а universe of ages. 
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It follows from the preceding definitions that a simple or 
random sample of an existent universe is not necessarily a 
simple or random sample of the hypothetical universe from 
which the existent universe was derived. But when the existent 
universe itself is a simple or random sample of a hypothetical 
universe, a simple random sample of the existent universe will 
be a simple or random sample of that hypothetical universe also. 

In practice, it is not easy to obtain a random sample. The 
most manageable case is that of a limited existent universe each 
of whose events can be individually identified, such as the list 
of marriage certificates mentioned above. If we take certificates 
or pages of certificates at regular intervals from an alphabetical 
or other random list, say every twentieth certificate or page, the 
first page being chosen at random, the sample should apparently 
be random, because there is no obvious connection between this 
order and the information on the marriage certificates. И 
the interval is not too large, this method should also be more 
representative than other types of random sampling, since 
it takes certificates proportionately from every part of the 
list. There are many other devices for taking a random sample 
from a list. One of the commonest is to number the items, 
place corresponding numbers on tickets in a box, shuffle them, 
and draw. Experience has shown, however, that methods 
like the above will not always yield a random sample. Mathe- 
matically, the ideal plan is to draw the sample from a table of 
random numbers, such as L. H. C. Tippett’s Random Sampling 
Numbers,! which are combinations of digits taken at random 
from census reports. A specimen page of these numbers is 
shown below (Fig. 51). 

Imagine that we wish to take a random sample of 200 marriage 
certificates from a list of certificates in a state file. The cer- 
tificates are filed and numbered consecutively, so that any nth 
certificate from the beginning of the list can be quickly located. 
The smallest number printed in the table is 0,000 and the largest 
is 9,999. If the number of events in the universe is close to 
10,000, we can simply go down each column of four figures in 
the table, taking for our sample the first 200 numbers within 
the range of our universe that we mect. When the universe is 


1 Tracts for Computers, Number XV, Cambridge University Press, London, 
1927. 
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much smaller than 10,000, it sometimes saves time to assign 
several table numbers to each event. For example, if there are 
only 2,000 events (e.g., marriage certificates) in the universe, we 
may assign to event number 1 the table numbers 0 through 4, 
to event number 2 the table numbers 5 through 9, and so on. 
If then we read, say, the number 0061 from the table, we draw 
event number 13 from the universe ($2 + 1 = 13).! The same 
event is accepted only once from a limited universe, regardless 
‘of how many times it may be drawn. ‘Also, the number of digits 
to read in the table may correspond to the number of digits 
needed to express the total of events in the universe. Thus, if 
the universe contains 800 events, we may draw three-digit 
numbers, e.g., 295, 016, 273, and so on; if the number of events is 
500,000, we may draw six-digit numbers, such as 295,266; 416,795, 
003,074. The table may be read in any direction or order. 

When the individual events of a limited universe cannot be 
identified and labeled, probably the next best thing is to identify 
groups of them, usually on a geographical-time basis. Thus a 
random sample of the farmers of a state, of whom there is no list, 
may be obtained by numbering each township in the state and 
drawing a random sample of townships by one of the methods 
suggested above. Then each township drawn in the sample 
may be visited at a given date, and all the farmers in it taken in 
the sample. Or, if necessary, the sample townships may be 
divided into school districts and a random sample of these dis- 
tricts drawn before going to the events (farmers) themselves. 
When this approach has to be used, the number of groups com- 
posing the universe should be as large as practicable, while the 
number of events in each group should be a minimum and 
nearly equal from group to group. For example, if the township 
is the smallest unit for which data are available, a township with 
a large population may be subdivided and represented by ‘two or 
three tickets, instead of by one, in the drawing, so that the 
probability of drawing 2 township will be roughly proportional 
to the size of its population. 

In dealing with an infinite or a very large universe, 16 is of 
course not possible to list and label all the individual events, but 


1 In this case, to find the serial number, if the table number is not already 
an exact multiple of five, reduce it to the nearest multiple of five, divide by 


five, and add one. 
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it may be feasible to use the group method mentioned in the 
preceding paragraph. For example, if a physical anthropologist 
wanted to sample the white race, he might divide the countries 
occupied by the various branches of this race into small geo- 
graphical areas, number them, and draw them at random. He 
would then probably have to go to each of the areas drawn, 
further subdivide them, draw a random sample of the small 
subdivisions, and then finally perhaps take a random sample of 
the individuals living in each subdivision. 

Such a plan as the above, however, is not adapted to a hypo- 
thetical universe, like the number of heads or tails that might be 
thrown with a penny, or the number of divorces that might occur 
in the United States over some future period of time. The only 
way to draw a random sample in this case is to define a set of 
conditions, or causal system (e.g., social conditions in the United 
States, Jan. 1, 1940 to Jan. 1, 1941), draw at random a number 
of hypothetical events that satisfy the conditions (couples 
married on Jan. 1, 1940), and let the system act to convert them 
to existent events (couples divorced, not divorced on Jan. 1, 
1941); or else wait until the system has produced a large number 
of existent events (couples married Jan. 1, 1940, after Jan. 1, 
1941), and then draw at random as many of them as are needed 
for the sample. In either case, if a simple sample is wanted, it 
is, of course, necessary to make sure that the existent events 
(couples divorced or not divorced Jan. 1, 1941) were derived 
from hypothetical events (couples married Jan. 1, 1940), each of 
which had (on Jan. 1, 1940) essentially the same a priori prob- 
ability of becoming a success (divorced couple) throughout 
the experiment (Jan. 1, 1940 to Jan. 1, 1941), except for chance 
factors. Notice also that a causal system that does not act 
uniformly over its time cycle must furnish sample events from 
the whole of its cycle, to avoid important omissions. For exam- 
ple, in determining the death rate of infants during the first year 
of life, observations should extend over the complete period of 
12 months, because the death rate is subject to seasonal variation. 

If a heterogeneous hypothetical universe, 7.e., a hypothetical 
Universe in which the chance of success is not the same from one 
‘ypothetical event to another (e.g., a class of life insurance risks 
of different ages, where p is the probability of an individual's 
death within the year, and a hypothetical event is а person taken 
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at the beginning of the year), exists without important change 
over a period of time, then a random sample drawn from a large 
number of the events at the end of this period will yield an esti- 
mate of the mean probability of death for the mixed class. 

When a universe is divided into strata with respect to some 
trait, a proportional! simple subsample is taken from each stra- 
tum, and these subsamples are combined, the resulting total 
sample is called a Poisson sample, in honor of the French mathe- 
matician who described it. Thus, for the purpose of drawing 
a Poisson sample, an existent universe of family incomes in New 
York City in 1939 may be divided into the classes: Under $500, 
$500-$999, $1,000-$1,499, . . . ; or, supposing that we are 
interested in divorce, we may define a hypothetical universe of 
ever-married women in the city of Philadelphia on Jan. 1, 1940, 
consisting of subgroups whose members are alike in respect to 
occupation of husband, presence or absence of children, religious 
affiliation, length of time since marriage, and so on. If all the 
requirements of Poisson sampling are to be met, each stratum 
must constitute an infinite subuniverse. In the case of our 
existent universe of family incomes, this will be approximately 
true if the number of incomes in each class is very large. Any 
hypothetical universe or stratum may be regarded as infinite on 
the assumption that the defined set of conditions theoretically 
acts to produce events without limit. For example, it may be 
reasoned that the conditions that produced a certain percentage 
of divorces among a group of Philadelphia women whose hus- 
bands were skilled laborers, who had borne children, who were 
Protestants, who had been married five years, and so on, might 
continue indefinitely to produce the same percentage of divorces 
(except for random errors) among women of this description. 
As a rule, however, it is more realistic to consider how long we 
may expect à hypothetical universe actually to persist without 
important change, and then decide whether the probable num- 
ber of events that will be produced in a given stratum within 
that period may be regarded as infinite for practical purposes. 
To refer again to our illustration, we might conclude that the 
set of conditions responsible for the divorce rate observed in the 
class of women defined above would probably remain essentially 

1 Preferably also weighted by the value of the standard deviation of the 
stratum or subgroup. 
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the same no longer than perhaps a decade, but that in 10 years 
several thousand women would come within the class, a number 
great enough to be regarded as infinite without noticeable error. 

If a simple sample is drawn from only one of several strata 
forming a universe, and from it an attempt is made to judge the 
ae universe, the sample is called a Lexis sample, after a Ger- 
man statistician. The sampling error of a Poisson sample is 
less than that of a random sample, while that of a Lexis sample is 
greater. A Lexis sample is seldom taken intentionally, but may 
occur when some important part of the universe is omitted from 
the sample.' The Poisson sample, on the other hand, is the 
most representative sample that can be taken of sociological 
data, and should be used much more than it now is. 

What has been said above about the sampling of an attribute 
(an unmeasured quality called an event, such as the survival or 
death of an insured person) applies equally to the sampling of a 
variable (a measured quality, such as the net annual income of а 
farm family). In the case of sampling a variable, the parameter 
in which we are usually interested is the mean of the values in the 
universe (e.g. the mean net annual income of the farmers in 
Nebraska), although i& may be the standard deviation, a cor- 
relation coefficient, or other index. 

When the purpose in taking a sample is to use the proportion, 
mean, or other statistic from the sample as an estimate of the 
Corresponding parameter in the universe, it is needed to know the 
range of error in the estimate due to sampling. This can be 
found only if the sample is approximately of some standard type, 
Such as random, simple, or Poisson. Thus, if we find from a 
simple sample of juvenile delinquents that 21 per cent were 
from broken homes, we are able to estimate with the aid of the 
mathematical theory of sampling that the chances are, say, 19 
to 1 that certain limits, say 15 to 27, will enclose the true per- 
centage from broken homes in the universe from which the sample 
Wastaken, If the nature of the sample is uncertain, so that we do 
not know that it is, say, simple or Poisson, we cannot apply the 
Appropriate formulas for finding the errors of sampling, and so 
cannot gauge the amount of error in any statistic estimated from 


* See Bruce D. Морсетт and S. В. Скуовктлмта, Reliability of Forest 


ide Journal of the American Statistical Association, Vol. 29, pp. 257-281, 
4. 
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the sample. The chief assurance that we can have about the 
nature of a sample must come from a knowledge of the method 
by which it was taken. Thus, we must know that the conditions 
of, say, simple sampling were at least broadly met in drawing the 
sample before we can safely treat it as a simple sample. 

3. The General Theory of Sampling.—In general, the theory 
of sampling that provides a basis for the measurement of sampling 
error is as follows. Suppose that we draw a large number, N, 
of random samples of equal size, n, from a universe of juvenile 
delinquents, and list the number of delinquents with broken 
homes (successes) in each sample. We shall then have a table 
like Table 66. 

This is a sampling distribution of the number or frequency of 
successes per sample. We may find its standard deviation by 


the familiar formula, 
— 2 
z [zf(X — М.) : 


where X is the number of successes per sample, М. is the mean 
number of successes per sample, f is the number of samples having 
a given number of successes, and N is the total number of 
samples. Since this is the standard deviation of the number of 
successes from many actual samples, we may call it an empirical 
standard error, to distinguish it from the standard deviation of a 
series where the question of sampling does not enter. We may 
further call the standard error of this formula the empirical 
standard error of the number of successes per sample, to differentiate 
it from the standard error of, say, & mean or correlation 
coefficient. 

An empirical standard error like the above has the disad- 
vantage, however, that it is itself a sample value that is affected 
by the number of samples taken, and varies because of random 
errors of sampling. Mathematicians are able to calculate a more 
exact or theoretical standard error,’ provided they are allowed to 
specify the nature of the distribution of the universe values and 
the conditions under which the sample is taken. This enables 
them to lay down requirements which ensure that the parameter, 


10r probable error, if preferred. From Chap. IX. pp. 160 and 161, it will 
be recalled that the probable error is related to the standard error by the 
equation Р.Е. = .6745c, where с = є in our subsequent notation. 
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say, a frequency, will be distributed in the samples according to 
some established mathematical principle, such as the binomial 
theorem or the normal curve. 
TABLE 66.—DISTRIBUTION ог BROKEN HOMES PER SAMPLE OF 50 JUVENILE 
DELINQUENTS IN 100 Ranpom SAMPLES 
Broken Homes 


per Sample Samples 

0 0 

1 0 

2 0 

8 1 

9 1 

10 2 

11 5 

12 9 

13 10 

14 12 

15 12 

16 11 

17 8 

18 9 

19 5 

20 6 

21 4 

22 Б] 

23 1 

24 0 

| 25 1 

| 26 0 

| 27 0 
| 

48 0 

l 49 0 

50 130, 

оба, EISE 100 


Some of the commonest of the standard error formulas that 
ате applied in the sampling of attributes assume that the sampling 
distribution is binomial in type. It will be recalled! that № times 


‚ +See Chap. iX. 
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the binomial expansion shows how many of N random samples of 
n events each may be expected to have given numbers of suc- 
cesses from 0 to т, where the probability of success remains the 
same from event to event. If it can be shown that these require- 
ments were at least approximately complied with in drawing the 
events of a sample, we may assume that the sampling distribution 
will be approximately binomial inform. The standard deviation 
of the binomial is well known, and is then the theoretical standard 
error that we are seeking: 


e; = V npg, 


where р is the constant chance of success in the binomial universe, 
q = 1 — р, n is the number of events in a single sample, and 
fis the frequency. We have only to substitute in this formula to 
get the standard deviation (called standard error) of the sampling 
distribution. 

7 Та the investigation of sociological attributes, however, there 
is usually available only one sample, rather than a distribution 
of many samples. In that case, if the sample was taken under 
binomial conditions, and its size is large, the best estimate of p, 
the proportion of successes in the universe, is the proportion of 
successes in the single sample. This estimate of p is then used 
in the above formula to compute an approximate theoretical 
sampling error. 

It will be noticed that the binomial theorem merely repeats 
the requirements of the simple sampling of attributes, which we 
have seen can be met only if an existent universe is infinite and 
well mixed, or a hypothetical universe is homogeneous. Because 
of the difficulties in taking a simple sample under many conditions 
in sociological research, it is fortunate that the standard errors 
of simple samples are usually not very different from those of 
random samples, and in any case are somewhat larger. For 
these reasons, investigators often apply simple sampling errors 
to random or even stratified samples, in order to save labor or 
to be on the conservative side when in doubt as to what the 
error formula should be. 

4. Only Large Samples Considered.—In sociological inves- 
tigations, many factors that the sociologist is unable to control 
usually cause small samples to differ radically from one another. 
Small samples are, therefore, not often used in social research. 


SAMPLING AND SAMPLING ERRORS 235 


For this practical reason and for the sake of simplicity, the discus- 
sion in this book is limited to large samples. Аз a rule, the 
standard error formulas given may become rather seriously 
inaccurate if applied to samples with fewer than 20 to 25 items, 
and are safest when used with much larger samples. 

5. Standard Error Formulas. a. The Standard Error of a 
Frequency.—As just shown, for a simple sample, the formula 
for the standard error of a frequency is 


«= vm = Ara - 1) me 


where p is the constant probability of drawing a success at any 
single draw, q = 1 — p, and f is the frequency in question. 
D refers to the probability in the universe, t.e., to the true or 
expected probability, and f to the true frequency, but they are 
usually estimated from the sample when the latter is large. 

We shall illustrate the use of this formula by application 
to the age distribution of an approximately simple sample of 
unemployed in New York City in 1930, shown in Table 67. 


TABLE 67.— Tum DISTRIBUTION оғ UNEMPLOYED PERSONS BY ÅGE, IN A 
SIMPLE SAMPLE or 100 UNEMPLOYED IN New York Crry, 1930 


Age, years ba ue d fd Ја? 
28 —2 —56 112 

28 —1 —23 23 

21 0 0 0 

16 a 16 16 

9 2 18 36 

3 3 9 27 

100 —36 214 


What is the range of error in the sample estimate of the relative 
number of unemployed in the age class 15-24? If we assume 
that the universe of the unemployed in New York is existent 
and large enough to be regarded as infinite for our purposes, and 
that it does not change appreciably during the process of sam- 
Pling, then the probability of drawing an unemployed person 
№ the age class 15-24 should be constant from draw to draw, 
1 Зее See. 6. ' 
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and from one sample of 100 persons to another. Thus the 
requirements of simple sampling are met, and we may determine 
the error of sampling of the frequency by formula (113). Since 
n is as large as 100, we accept 28 as an estimate of the true 
frequency in the age class 15-24. Substituting in formula (113), 


ез = V28(1 — тру), 


€23 = 4.5. 


In a large number of such samples the frequency in the age 
class 15-24 is approximately normally distributed, and the 
standard error just found is an estimate of the standard deviation 
of that normal distribution. Under these conditions, about two 
times in three the true frequency in the age class 15-24 will be 


r 


Ета. 62.—Showing a range within which the true number of unemployed in. 
the age class 15-24 years will be enclosed about 19 times out of 20 (in the long 
run), as determined from a simple sample of 100 unemployed in New York 
City, 1930. 
included within one standard error above and below the sample 
frequency of 28. That is, about two times out of three, we 
should expect the true number of unemployed persons in the age 
class 15-24 to be contained between the limits 28 + 4.5, or 
between 23.5 and 32.5. If we want more security than this, we 
may multiply the standard error by two, getting limits of 
28 + 2(4.5), or 19 to 37, within which about 19 times in 20 the 
true frequency will be found.! To attain practical certainty, we 
may multiply the error by three, giving chances of about 369 to 1 
that the true frequency is enclosed between 28 + 3(4.5), or 
between 14.5 and 41.5. Usually, a range of twice the standard 


1 Зее Appendix Table 1. 
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error is regarded as safe enough. In case this range, here 
19 to 37, seems too wide to be of much value, and it is wanted 
to narrow it, the size of the sample must necessarily be increased, 
since the size of the sampling error varies directly as y/n (see 
Sec. 6). 

Evidently, if the size of the sample is decreased, the relative 
range of the sampling error increases, so that one reason why a 
small sample is not suitable for estimating the value of a param- 
eter is easily seen. For example, suppose that the number of 
persons in the sample of Table 67 is only 10, and the frequency 
in the age class 15-24 is 3. Substituting in formula (113), with 
a factor n/(n — 1) = 40 inserted as a correction for the small 
size of the sample, 


єз = VIBU — 1)], 
єз = 1.58. 


We no longer have confidence in the sample frequency as an 
estimate of the universe frequency for use in the formula, but 
disregarding this, we find the range of twice the standard error 
to be 3 + 2(1.53) = —0.06 to 6.06, or approximately 0 to 6. 
The ratio of twice the error to the frequency is now 3.06/3 = 1.02, 
аз compared with 8; = 0.32 for the larger sample. 

If it is known that a sample was taken under Poisson condi- 
tions from a stratified universe, the standard error of a frequency 
estimated from it may be obtained by the formula 


e? = npg — тту, (114) 


where р; is the proportion of successes in any stratum, j, of the 
universe; р is the mean of the рів; c5? is the variance of the 
Dj's: 

ivi p, (115) 


and k is the number of the strata. As in the case of formula 
(118), if the universe values of these statistics are not known, they 
are commonly estimated from the sample, provided the fre- 
quency in each stratum of the sample is fairly large (say, 50 or 
More), 
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TABLE 68.—O0xr HUNDRED UNEMPLOYED Persons iN New York Сити, 
1930, CLASSIFIED BY COLOR AND NATIVITY 


Number of unemployed 


Age, years к 44 | Foreign-b 

кош Total. Native white cc irm Negro 

(2) (3) 

15-24........ 28 18 5 

25-84........ 28 13 3 

85-44........ 21 7 5 

TUNE 16 3 

В 3 

0 


Assume that Table 68 above is a Poisson sample, drawn as 
previously described, the strata being native white (7 = 1), 
foreign-born white (7 = 2), and Negro (j = 3), as shown in the 
table. Let nı = 42, лә = 39, and n; = 19; and let the numbers 
in each stratum falling in the age class 15-24 be fi = 5, f; = 18, 
and f; = 5, giving pı = Ф; = 0.12, po = $$ = 0.46, and 


рз = ту = 0.26. 


Then р = #8 = 0.28, = 1 — р = 1 — 0.28 = 0.72, and from 
formula (115) 


ap? = [42(.12)* + 39(.46)? + 19(.26)*]/100 — (.28)? = 0.023. 
Substituting in formula (114), 


e? = 100(.28)(.72) — 100(.023), 
ej? = 17.86, 
€f 4.23. 


Notice that this error is slightly smaller than that found on the 
assumption that Table 67 represented a simple sample. 

It may be objected that in Table 68 the frequencies in cols. (1), 
(2), and (3) are not large enough to yield very good estimates 
of the true values in the universe. 

b. The Standard Error of a Proportion.—In dealing with Table 
67, above, as a simple sample we may think of the frequency 28 
in ‘the age class 15-24 as a proportion of the total frequency 100, 


ии 
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р = £8; = 0.28, and use formula (116) to find the standard 
error of this proportion: 


m R- рар), (116)! 


Substituting in (116), 


280 — .28) 
i о, 


Ets = 0.045. 


Therefore the proportion of unemployed persons in the age 
class 15-24 estimated from the sample and the range of error 
of the true proportion may be written 0.28 + 0.045. This means 
that the chances are two to one that the true proportion, or 
parameter, is not less than 0.235 or more than 0.325. 

If we suppose, as we did in the preceding section, that Table 68 
is a Poisson sample, the formula for the standard error of a 
proportion is 


=й, (117) 


Using the same values as for formula (114), we have, 


(.28)(.72) _ 023 
єз” = 100 100’ 
ess? = 0.001786, 
es = 0.0423, 


which is again smaller than the standard error of the same pro- 
portion estimated from Table 67 regarded as a simple sample. 

с. The Standard Error of an Arithmetic М ean.—Even when the 
universe departs considerably from normality, the means of large 
samples tend themselves to be normally distributed. 

Formula (118) gives the standard error of the arithmetic mean 
found from a simple sample: 

EU 118 
en = ӯ (118) 
‚ ! Just as a frequency is changed to а proportional frequency by dividing 
it by n, so the standard error of the former [formula (113)] is changed into 
the Standard error of the latter [formula (116)] in the same way: 


1 poc 
e = = Vp = үй. 
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where М is the total frequency of the table or the sample, and c is 
the standard deviation of the universe, estimated from the 
sample. 

The mean of the simple sample of Table 67, taking the mid- 
point of the open interval at 70, is 


M=A-+i- = 
М =40+ m 10) = 36.4. 


The standard deviation is 


: 20 


o = 14.2. 
Substituting these values in formula (118), 
Se: 
= VAN 
ex = 1.42. 
We therefore write for the mean and its standard error 
36.4 + 1.42. 


For а Poisson sample, the standard error of the mean is given 
by the formula 


(119) 


where c? is the variance of the universe estimated from the total 
sample, and 


k 
2 
j= ра — м (120) 


, 


where m; is the mean of the jth stratum. Аз usual, all statistics 
are estimated from the sample when the true values in the 


universe are not known. 
Referring to Table 68, we found that 


c? = (14.2)? = 201.64, 
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and M? = (36.4)? = 1,324.96. Let the mean age of the native 
whites be mı = 44, of the foreign-born whites m; = 27.7, and 
of the Negroes m; = 38. We compute 


a _ 42(44)? + 39(27.7)? + 1988)? _ | 324.96 


iol 100 
c»; = 61.76. 
Substituting in formula (119), 
201.64 61.76 _ 
at = gr e А, 
Ем = 1.18. 


As before, we see that the standard error of the Poisson sample 
is smaller than that of the simple sample. 

The standard error of the mean is most useful in testing the 
significance of the difference between two means, to be treated 
later. 

d. Standard Error of the Standard Deviation.—Vor a simple 
sample drawn from an approximately normally distributed 
universe, the standard error of a standard deviation is 


т 
= 121 
== on ; ) 
Where с is the standard deviation of the universe, estimated 
from the sample. 


TABLE 69.—Sconzs or 100 COMMUNITIES ON A COMMUNITY ORGANIZATION 
Test 


Commu- 
nities 


0) 


Score (X) 


eee | = (өш ee к 
80-99 372.6 
60-79 363.8 
40-59 60.2 
20-39 372.0 


0-19 
Total... 


. Table 67 is а J-shaped rather than а normal distribution, so 
a does not lend itself to formula (121). We shall, however, 
пв applying the formula to Table 69, which is only moderately 
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skewed. The 100 communities were taken at random from the 
total of some 300 cities of a given size class in the United States 
the name of each community taken being replaced before tho 
next draw. The sample may, therefore, be regarded as a simple 
sample, representing an infinite existent universe of cities like 
the 300 cities reported by the census. 

The standard deviation of Table 69 is 


any, пош (any 
в = 20 40700 (5) = 21.6. 


So that, by formula (121), 


LE 21.6 " 
м (2) (100) 
And we write for the standard deviation and its standard error 


21.6 + 1.53. 
e. Standard Error of a Variance.—Assuming as before a simple 


sample from an approximately normal universe, the variance, 
o?, has the standard error, 


бз = а? > 122 
N (122) 


The variance of Table 69 is (21.6)? = 466.56, and its standard 


error is 
є = 466.56 iss = 65.78. 

Jj: Standard Errors of Sampling from a Limited Universe.—A. 
art of the sampling done in social research is from limited 
han from infinite universes. It has already been seen 
a limited universe а random sample can be drawn, but 
mple cannot. All the formulas for finding the stand- 
f a simple sample given above, therefore, need a 
the sample is drawn at random from a limited 
if the sample is random but not simple. In the 


great p 
rather t 
that from 
a simple sa 
ard errors О 
correction if 


universe, 4.€., x 
а mean, frequency, or proportion, the correction consists 


case of 

3 ing the multiplying factor, AK — s)/U, to the stand- 

ard error of а simple sample, where U is the number of events 
:mited universe, and s is the number of events in the 

so that s = T OT N in our formulas above. It is not 

that this correction is applieable to standard errors 


certain E 
mentioned. 


other than those 
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One or two illustrations will suffice to show how the correction 
factor ~/(U — s)/U is used. In the section dealing with the 
standard error of a frequency for Table 67, we found ess = 4.5. 
Now if we regard the universe of the unemployed in New York 
City from which this sample of 100 was drawn as a limited 
universe, consisting in the year 1930 of an average of 300,000 
persons, we have 


(U — s) _ [(800,000 — 100) — 
\ uU 300,000 na. 


Multiplying this into the standard error found by assuming an 
infinite universe we get .9997 (4.5) = 4.499, which for all practical 
purposes is the same as before. This suggests that when the 
limited universe is quite large there is no need to make the 
correction. 

Suppose from a limited universe of 1,382 divorces granted in а 
certain court in 1939, а random sample of 200 is drawn. From 
this sample the mean legal cost of getting a divorce is found 
to be $136, and the standard deviation $32. If we regard the 
universe as limited, the standard error of $136 is obtained by 
multiplying the corrective factor into formula (118), the standard 
error of the mean of an infinite sample: 


(U—s) «с uec 32 ) 
Lie gf =") = .925(2.26) = 2.09. 
\ U JN 1,382 4/200 а 


In this case, the correction for a limited universe reduces the 
standard error of the mean over 7.0 per cent. 

g. Standard Errors When the Unit of Sampling Is a Group of 
Events, or District; and the Standard Error of a Population Rate.— 
When a sample of districts, instead of individual events, is 
taken, the district simply replaces the individual event in the 
appropriate standard error formula. That is, n or N becomes 
the number of districts, rather than the number of events. 
The proper standard error formula to use in any given case 
depends as before on the conditions under which the sample was 
drawn. However, only those standard error formulas are 
appropriate for districts that apply to variables, because, disre- 
garding sampling errors within districts, each district is merely 
One value of a variable, such as a proportion or mean, determined 
by the events within the district. In finding the mean, the 
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variance, and so on, of the district values, it is usually advisable 
to weight the latter by the number of events in the respective 
districts. 


TABLE 70.—Вїнтн Rates IN 20 Counties ог Wisconsin, 1935 


Birth rate = | Population, | Products = Squares = 

cid Ф) |1930=(%)| (Yo) (Yap?) 
1 .0186 8,003 148.856 2.768,718 
2 .0222 21,054 467.399 10.376,253 
3 .0184 34,301 631.138 11.612,947 
4 .0125 15,006 187.575 2.344,088 
5 .0221 70,249 1,552.503 34.310,314 
6 .0175 15,330 268.275 4.694 ,813 
7 .0172 10,233 176.008 3.027 ,331 
8 .0157 16,848 264.514 4.152,864 
9 .0205 37,342 765.511 15.692,976 
10 .0173 34,165 591.055 10.225, 243 
11 ‚0174 30,503 530.752 9.235 ,088 
12 .0225 16,781 377.573 8.495 , 381 
13 .0171 112,737 1,927.803 32.965,426 
14 .0144 52,092 750.125 10.801,797 
15 . .0208 18,182 378.186 7.866,260 
16 .0162 46,583 754.645 | 12.225,243 
17 ‚0187 27,037 505.592 9.454 ,569 
18 .0220 81,087 903.914 19.886, 108 
19 .0178 3,768 67.070 1.193,853 
20 =n .0173 59,883 1,035.976 | 17.922,383 
$ 671,184 12,284.470 | 229.252,255 


In Table 70 is a random sample of 20 counties of Wisconsin, 
showing their birth rates (р; = X;/Y; where X; = births) in 
1935. The mean birth rate for the table is 


5 (Yip) 
Xx 


and the variance is 


n n 2 Е 
У Үр? [> Ym) _ 229.252,255 _ go 2 

Sa УЫ Р 27 ~ ~ 671,184 671,184 
265 NSE. 


00000667, ^ sothat о» = \/.00000667 = .0025826. 
| (124) 


12,284.470 _ 
за = 01830 (123) 


p= 


ср? = 


\ 
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By formula (118), 
_ 0025826 


“= VES 


Or, combining formulas (124) and (118), and adding a term 
due to errors of sampling within a district, the standard error 
of a population rate is approximately 


ирс (ё те) +. (195) 
рф > i EY 


= .0005775. 


"Thus, for Table 70, 


-0183(.9817 
6" = (.0005775)? + MO 
TA .017,965,110 
= .0000003335 + — 671184 — = .0000003335 
+ .000000026766 = .000000360266, 
вр = 0.0006. 


Since we think of the 71 counties of Wisconsin in 1935 as a 
limited universe of birth rates, we should apply the correction? 
factor, J/(71 — 20)/71 = +/-7183, giving 

.0006(.8475) = 0.000509 


as the final standard error. We therefore write the mean birth 
rate and its standard error: 0.0183 + 0.00051, or multiplied by 


! More exactly, the last term is 
E^ Yip? 
A Y; > Y; 
= gnus 184 


When. the population is large, however, this term is usually negligible. 
*Or, using population weights, the correction factor is 


Xn-Xn Xr 
x 


Where N is the number of districts in the universe and n is the number of 
istricts in the sample. 


229.252,255\ _ 
gras; (0183 - "лли = 0.000000026756. 
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1,000, 18.30 + 0.51. The chances are 19 in 20 that the birth 
rate per 1,000 for the state as a whole will be enclosed within 
the sample range 18.30 + 2(.51) = 17.28 to 19.32. Asa matter 
of fact, the birth rate for Wisconsin in 1935 was 17.3, almost at 
the bottom limit. This is because the city of Milwaukee, with 
a very low birth rate of 15.0, happened to be left out of the sample. 
The birth rate for the state without Milwaukee was 18.09, which 
is well within the estimated range of sampling error. 

This case illustrates one of the dangers of sampling by dis- 
tricts, namely, that the sample may omit a district with an 
extreme value and a very large number of events. This is avoided 
when the events are sampled directly. In the case of birth 
rates and other population rates, sampling by districts is unavoid- 
able, but counties like Milwaukee should be subdivided into 
several average-sized population districts, each with the given 
Milwaukee birth rate. Then the chance of such an omission 
from the sample is lessened. 

It was assumed above that all the events in each sample 
district were used to determine the district value. Sometimes 
it is necessary to sample the events in sample districts. This 
might be the case if we wanted to study a few hundred farmers’ 
household accounts in a given state. We would probably draw 
a sample of counties, but could not get accounts from all of the 
farmers in a sample county. This random sampling of events 
would increase the sampling error within the districts. For 
example, in formulas (123), (124), and (125), Y; would become yi, 
where y; is the size of the sample population drawn from district 
i, on which the birth rate, p; is calculated. 

6. Control of Sampling Error by Size of Sample.—The number 
of items that a sample should contain to yield a satisfactorily 
accurate picture of the distribution in the universe from which 
it is taken depends on the number of different kinds or classes of 
items that it is necessary to distinguish (2.е., on the heterogeneity) 
in the universe, on the relative frequencies of the items in each - 
class, and on whether the items are stratified or mixed. This may 
be explained by a simple illustration. 

If the universe is limited and consists of two individuals, & 
white and а Negro, who are to be examined for skin color, 
evidently the sample will fall short of giving а proper picture 
of the universe, or of being representative, unless it contains 
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both of the individuals, or all of the items in the universe. 
Should the universe contain thousands of individuals but only 
two skin colors, each equally distributed among the population 
and subject to no variation in shade from one individual to 
another, a perfectly representative sample need still contain 
only one individual of a color, each drawn at random from a 
color group or stratum. If the same universe is not stratified, 
however, but the sample has to be drawn at random from the two 
races mixed together, a sample of more than two individuals 
should be taken, since otherwise the chance of getting all indi- 
viduals of the same color is one in four [($ + 3)? = G +4) + Hl. 
In fact, a fairly large sample—say not less than 25 items—is 
now advisable, to lessen the risk that one of the colors will 
appear much more often than the other, and so give a false 
impression of its relative frequency in the universe. Finally, 
suppose that the universe includes many individuals of the same 
race—say Negro—but the skin color varies widely among the 
individuals. Suppose, further, that we want to learn from a 
sample the relative frequency of occurrence of the various shades 
of skin color, including the extreme shades that the color takes. 
If some shade, say intensely black, exists in only one individual 
per 1,000 in the universe, a random sample containing even 
as many as 100 individuals will fail to include it nine times in 
10 [(1.000 — 0.001) 14. 

If it is wanted to use the sample merely to estimate the mean 
of the universe distribution, omissions at one part of the scale 
may cancel omissions at another part, so that the size of the 
sample need not be so great. Yet, for a given degree of accuracy 
in the estimate, the size of the sample must be increased as the 
variance, о?, of the universe distribution, also estimated from the 
Sample, increases. 

It is theoretically a simple matter to reduce the standard 
error to any desired value by merely increasing the size of the 
Sample, М. For this purpose, we have the formula 


N: = a?N3, (126) 
where 


‘a= (127) 


22 
ale 


з is the value of the original standard error, ез is the value of the 
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desired standard error, Л: is the size of the original sample, and 
М» is the size of sample needed to reduce є to єг. 

In Sec. 5c, above, we found the mean, 36.4, and its standard 
error, 1.42, from a simple sample of 100 items. What size of 
sample is required to reduce the standard error to one-half its 
present value? According to this requirement, e» = є1/2, so 
that а = 2. Substituting in formula (126), 


М» = (2)*(100) = 400. 


Notice that when the divisor of the original standard error is 
named, we have only to multiply the size of the sample by the 
square of that divisor. That is, to divide the standard error by 
two, we multiply the size of the sample by four. This rule 
applies to any common standard error except the standard 
error of afrequency. The easiest way to deal with the standard 
error of a frequency is to substitute for it the standard error 
of the equivalent proportion, for which the above rule holds. 
For example, in Sec. 5b, above, the frequency, 28, in Table 67 
was changed to the proportion, 0.28, for which the standard 
error was estimated to be 0.045. To reduce this error, or the 
relative error of the corresponding frequency, to one-third of 
its value, we multiply №: = 100 by (3)? = 9, giving 900 as the 
size of the sample required. ` 

The problem of determining the proper size of fairly large 
samples may be approached in terms of confidence or fiducial 
limits. That is, we may require the sample to be of such a size 
that about 2P times in 100 the value of a parameter in which we 
are interested will be enclosed within a specified range. Using 
again the example of Sec. 5c, let us say that it is wanted to take a 
sample of such size that the chances are about 95 in 100 that the 
parameter will be enclosed within a range that extends on either 
side of S a distance equal to 10 per cent of the value of S. We 
then require 


á (©) = gii (128) 


с 


where, for this particular problem, 8 = М, = 36.4, 
ем: = o/ VN: = 14.2/ VN», 


х is а mean deviate of S, p’ is one-half the width of the range 
expressed as a percentage of the value of S, 5 is the value read 


from a table of normal areas corresponding to one-half of the area 
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enclosed by the specified confidence limits, P = .95/2 = 0.475, 
and N; is the required size of the sample. Assuming that the dis- 
tribution of sample means from large samples is approximately 
normal in form, we turn to Appendix Table 1, and find that 
2P — 2(.475) — 0.95 between the points z/c = +1.96, so that 


z/sc = 1.96. Substituting in formula (128), z 


Z = p'X, we 


$ 


have 


2142 (1.96) = .10(36.4), 


VN: 
VN: = 7.64, 
N: = 58.4. 


Ste (3 (129) 
36.4 + —— 4. (1.96), 


36.4 + 3.64. 


Since 3.64 is 10 per cent of 36.4, we have the result desired. 
Notice, as a further check, that in our solution the standard 
error of the mean is only 14.2/4/58.4 = 1.86. If the mean age 


To check this, we write 


X 


966 0 
32.16 3640 40.04 
Fio, 53.—Showing 95 per cent confidence limits for the mean of a random 
Tample of 58 items. About ninety-five chances in a hundred the true mean will 
© enclosed within the limits 32.76 and 40.04. 
Varies by 10 per cent of its value, however, it will vary by +3.64 
Years, which is 3.64 /1.86 = 1.96ем. But ordinates of the normal 
curve at the points +1.96 standard errors include 95 per cent of 
€ area of the curve. Therefore, the chances are about 95 in 100 
at the true mean will be found within +10 per cent of the value 


9f the mean of our sample. 
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7. Error in Mean vs. Individual Predictions from a Regression 
Equation.—Interest often centers in predicting averages rather 
than individual values from a regression equation.! Thus, out 
of 20 counties with birth rates of 18 per 1,000, how far does the 
mean observed death rate differ from the most probable death 
rate predicted by the regression equation? The scatter of the 
observed means of such samples around the predicted value 
in this case depends upon the size of the sample, N, as well as 
upon the value of 7, and may be found from the equation 


(130) 


It appears from this that the scatter in predicting the mean 
value of Y corresponding to a given value of X, or to the mid- 
point of an X class interval, is reduced, compared to the scatter 
in calculating any individual value of Y, in proportion to the 
square root of the size of the sample from which the mean is 
found. For the data of Table 50 of Chap. X, if we take a 
sample of 20 counties all with approximately the same birth 
rate, equation (130) gives 


1.86 
=—= = 0.416, 
> 3/50 


which is less than a quarter of the size of the standard error of 
estimate, S, (= 1.86), that governs the prediction of a death 
rate from a birth rate in the case of a single county. 

8. Representativeness of a Sample.— The test of the goodness 
of a sample is simply the test of its representativeness. If we 
knew the value of the parameter, we could measure the repre- 
sentativeness of the sample in terms of the percentage deviation 
of the statistic from the parameter. Thus, if s is the statistic 
and S is the corresponding parameter, the formula for measuring 
the representativeness, Rp, is 


Rp = [200 — 100 G= S] per cent, (131) 


where we take S — sif S > s. 
The value of the parameter is seldom known, however, for if 
it were, it is not likely that a sample would be taken. This 
1 See Chap. X, Tables 50 and 51. 
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means that there is generally no direct way of measuring the 
representativeness of a sample. The nearest approximation 
would be to take several additional samples, each equivalent in 
method of drawing and (preferably) in size to the original sample. 
Then, in addition to noting the variation of certain statistics 
from one of these samples to another, we might pool the samples 
to obtain an average statistic, average the statistics found from 
them, and substitute this average value for S in formula (131), 
above. But if this were done, we would of course at once 
abandon the statistic from the original sample in favor of the 
average statistic from the several samples, whose representative- 
ness would still be unknown. Аза rule, therefore, the best that 
we can do in the way of formulating an index of representative- 
ness is to rely on a large sample, and, where possible, stratification 
of the universe, and measure the probable maximum deviation 
of the statistic from the parameter in terms of, say, two standard 
errors of the statistic («). This permits us to say that the 


probable minimum representativeness, Rp, of the statistic is 


Rp = (1o = 2e) per cent. (132) 


In See. 5c, above, we found the mean age of a simpie sample of 
100 ages to be 36.4 years, and the standard error of this mean to 
be 1.42 years. If we knew that the mean age in the universe 
from which the sample was taken was 37.5 years, we would find 
the representativeness of the sample by formula (131) to be 


(37.5 — 36 | 
Rp = [100 = 100-735 
= 100 — 2.9 


= 97.1 per cent. ` 


But if we did not know the parameter value, we would estimate 
the probable minimum representativeness by formula (132), 


де 1.42 
Ёр = [ — 200 (=) 
= 100 — 7.8 
= 92.2 per cent. 


‚ Ап indirect but important method of judging the representa- 
lveness of а sample makes use of the circumstance that although 
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the value of the attribute or variable for which we are sampling 
is not known in the universe, other universe values may be 
known; and if the sample can be shown to be representative of the 
latter, it is likely to be representative of the former also. As an 
illustration of this, we may draw a random sample of families in 
an Alabama county for the purpose of determining by field visits 
the percentage whose annual income falls below a certain mini- 
mum level. After the sample is obtained, it may be compared 
with the figures of the latest Federal census for the given county in 
regard to median size of family, the proportions of families having 
different numbers of children under 10 years of age, the per- 
centage of families that do not own their home, and the median 
rental paid. If a reasonably close agreement is found between 
the sample and the census population in these respects, the 
sample may usually be regarded as satisfactory also for the study 
of incomes. 


Exercises 


1. Define in both time and space (1) an infinite universe, (2) а limited 
universe, (3) a hypothetical universe, choosing in each case a universe 
of interest to social scientists. 

2. Give an example of a universe of social attributes, and define the 
event and the “success.” 

3. Illustrate a universe of a social variable. 

4. Draw a sample of events or values from an actual known social 
universe, so that the sample will be (1) random, (2) simple, (3) Poisson 
(stratified). 

5. Draw a random sample of districts from a known social universe 
of your own choosing. 

6. In Table 34 of Chap. VIII, what is the standard error of the 
frequency in the class X = 0? What does it mean? 

7. In Table 34 of Chap. VIII, what is the standard error of the pro- 
portion of prisoners with no previous arrests? How does this standard 
error compare with that for a frequency found in Exercise 6 above? 

8. Ten thousand marriage certificates issued in the same month in 
five large American cities are taken as the universe, and a random 
sample of 500 certificates is drawn from them. After one year, it is 
found that 78 of the 500 marriages are divorced. What is the mean 
probability of divorce in this heterogeneous universe of marriages, 
and what is its approximate standard error? 

9. Judging from the sample in the following table, what is a range 
within which the true number of Orientals immigrating to the United 
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SAMPLE or 740 CHINESE AND JAPANESE IMMIGRANTS TO THE UNITED STATES, 
BY YEAR OF ARRIVAL 


Year Japanese 


1929 102 65 167 

1928 115 49 164 

1927 105 43 148 

1925 and 1926 187 74 261 
2 а аа жа аы 509 231 


States in the year 1929, expressed as a percentage of the total Oriental 
immigration over the five-year period, 1925-1929, will fall 95 times out 
of 100? Compare the standard errors found on the assumptions of a 
simple sample from an infinite universe, a random sample from a 
limited universe (total Chinese immigrants, 5,090, total Japanese 
immigrants, 2,314), and a Poisson sample from a limited universe. If 
the sampling was random and proportional between Chinese and 
Japanese, which of these assumptions seems preferable, and why? 

10. What is the standard error of the mean in the table of Exercise 3 
of Chap. XIII for urban families? How do you interpret it? 

11. Below are given the number of children under 5 years of age and 
the number of women aged 15-45 years for each of 20 random counties 
in Wisconsin in 1930, with the resulting fertility ratios. 

Within what range will the fertility ratio for the state of Wisconsin 
fall, 95 times out of 100? (Nore: the fertility ratio in the State of 


ЁЕвтилтү Ratios AND Basic Dara, 20 ВАхром COUNTIES IN Wisconsin, 


1930* 
County Children under | Women 15-45 = Xi. 2 
code 5 = Xi Y: Yi 
1 731 1,523 48 
2 1,968 4,331 45 
3 3,463 7,084 49 
4 1,243 2,728 .46 
5 6,998 16,408 48 
6 1,562 3,157 -49 
7 953 1,924 .50 
8 1,619 3,526 .46 
9 3,707 8,118 .46 
10 3,330 6,765 49 
п 2,536 6,166 41 
12 1,745 3,339 -52 
13 10,016 27,401 .37 
14 4,504 10,889 41 
15 1,757 3,671 .48 


* From Fifteenth Census of the United States, 1930, Population, Vol. III, Part 2, pp. 
1314-1319, 
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i 
| 
FrnrruTY Ratios AND Basic Data, 20 RANDOM COUNTIES IN WISCONSIN, | 
1930.*—(Continued) | 
County Children under | Women 15-45 = DORT | 
code 5 = Х; Ү; Xo | 
16 3,707 10,546 .35 
17 2,796 5,550 .50 
18 3,758 9,692 .39 
19 395 648 297 | 
20 5,364 13,330 .40 | 
пез 2672808 62,152 146,791 .42 | 
* From Fifteenth Census of the United States, 1930, Population, Vol. III, Part 2, pp. 
1314-1319. 
Wisconsin as a whole in 1930 was about 0.41. There are 71 counties 


in the state.) 

12. Within what range will the standard deviation of the fertility 
ratios in the universe fall 95 times in 100, according to the random 
sample in the table of Exercise 11 above? Can the standard error of 
the standard deviation be applied to urban families in the table of 
Exercise 3 of Chap. VIII? Explain. 

13. What size sample of rural nonfarm families in the table of Exer- 
cise 3 of Chap. XIII is needed to reduce the standard error to one- 
half its value? 

14. In the table of Exercise 9 above, what size sample of Japanese is 
required to confine the true value of the proportion of immigrants in 
the year 1929 within 5 per cent of the observed value 99 times in 100 
(2.е., within 99 per cent confidence limits)? 

15. Measure the probable minimum representativeness of the mean 
score in Table 69, above. 
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CHAPTER XIII 
THE SIGNIFICANCE OF DIFFERENCES 


1. The Meaning of Tests of Significance.—It has been seen 
that the value of a statistic estimated from a random sample usu- 
ally differs somewhat from the true value, or parameter, in the 
universe from which the sample is drawn. Similarly, the values 
of a statistic, such as the mean, yielded by two or more random 
samples from the same universe, will almost never be exactly 
the same, and may sometimes be quite far apart. Such varia- 
tions, however, are due merely to chance errors of sampling 
and imply no actual differences. On the other hand, samples 
from different universes yield statistics of different values which 
represent real differences in the parameters. It therefore becomes 
а matter of great importance in investigations based on sampling 
to distinguish between real differences and accidental ones. 

If we could be certain that two or more samples were taken 
at random from the same universe or from different universes, 
there would, of course, be no problem. In most of the practical 
Sampling work done in the social sciences, however, the investi- 
gator cannot feel entirely confident that his samples are random, 
and he knows so little about the universes from which they are 


` taken that he cannot say whether these universes are essentially 


the same or different. For example, if we try to take random 
Samples of 500 persons each from the total population of a city 


like Chicago, it will not be easy to insure that the selection will 


be random; or even to guarantee that the persons drawn will all 
belong to the population of Chicago. If the several samples are 
Dot taken on the same day or even at the same hour, the popula- 
tions sampled may be radically different, because of the traffic in 
and out of the city in the mornings and evenings, on week ends 
and holidays. As a consequence of such uncertainties, an 
Investigator feels the need for some kind of test that will lend . 
additional security to any inferences that he may draw from 
Samples. The development of such tests, based on the mathe- 
255 
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matical theory of probability, constitutes the major part of 
present-day statistical method. 

In Chap. XII, it was usually assumed that if the sampling was 
random or simple from a normally distributed universe, the 
statistic itself would be normally distributed over many samples. 
By the use of the standard error, it then became possible to 
estimate from the normal curve the probability that the param- 
eter would be enclosed within specified limits. It is now neces- 
sary only to extend these ideas to the differences between sta- 
tistics, and to direct attention to the common rule that if a 
difference as large as the one observed might occur by chance 
no oftener than five times in 100, it is regarded as a real 
difference. In that case, the difference is said to be significant. 

The differences that are tested by this method are of two general 
kinds. The first is the difference between the value of a statistic 
and the value of a known or hypothetical parameter. For 
example, can a group of mothers with a mean age of 27 years be a 
random sample from a universe of mothers whose mean age is 
24 years? Or, can a correlation coefficient, r = .34, be a random 
statistic from a universe in which there is no correlation, t.e., 
where r = 0? ‘The second kind of difference that is frequently 
dealt with is the difference between the statistics from two or 
more samples. Thus, can two groups of mothers, one with a 
mean age of 27 years and the other with a mean age of 31 years, 
be random samples from the same universe? If the test shows 
that the difference is significant, it is inferred that the answer 
to the above questions is negative, on the grounds that a negative 
answer is highly probable. If the test fails to show a significant 
difference, the sample is regarded as a random sample from a 
given universe, or two samples are regarded as random samples 
from the same universe, until tests applied to larger samples show 
the contrary. If it is not positively known that the samples are 
random, a nonsignificant test at least allows us to say that the 
observed differences are no greater than might occur with random 
samples. 

If a difference is defined as real when the probability of its 
occurring by chance is as low as five in 100, we are said to be 
using the 5 per cent level of significance. The fixing of this critical 
probability is arbitrary and a matter of convention. The 5 per 
cent level is rather widely used at present, but the 1 per cent 
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level is preferred when there is need to be more conservative. 
Reference to Appendix Table 1 will show that 5 per cent of the 
area of the normal curve lies beyond ordinates erected at about 
two standard errors on each side of the mean, while 1 per cent 
of the area falls beyond ordinates at plus and minus 2.58 standard 
errors (see Fig. 54). It was formerly the practice to insist that 
an observed difference must fall as far out as three standard 
errors. At that point the probability is only about 27 in 10,000 
that so large a difference might occur by chance in either direc- 
tion. This is too stringent for ordinary purposes, because it 
causes the investigator to withhold judgment in an unnecessarily 
large proportion of cases. 


Y 


Fig. 54. —Fivo per cent of the area of the normal curve taken at the positive end 
only, and divided equally between the positive and negative ends. 


Notice that since the 5 per cent level of significance, for 
example, includes 2.5 per cent of the area of the normal curve 
at each end of the X scale, it implies that the probability of 
Setting either a positive or a negative difference is sought. If 
it is desired to find the probability of getting say a positive 

ifference only, the reading is limited to the positive end of the 
Scale (see Fig. 54). 

2. The Significance of a Correlation Coefficient—Suppose we 

ave a correlation coefficient r = .34 from a simple sample of 
30 pairs of variates from normal universes, one variate being 
Scores on an I.Q. test and the other the scores of the same indi- 
viduals on a personality test. Is the value of the observed r here 
So small that it might occur as a random error in a sample from a 
Universe in which there is no correlation? 

О answer this question, we test the difference of the observed 
Value of > from zero. Appendix Table 4 has been designed to 
Provide a ready-made test of this sort in the case of the correla- 
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tion coefficient. From it the value of the coefficient that is 
just significantly different from zero at the 5 per cent (or the 
1 per cent) level of significance may be read off at once, and 
compared with the observed value. The table is entered with 
N — 2 degrees of freedom, which in this case are 30 — 2 = 28. 
At the 5 per cent level we find that an r = .36 is just significant. 
Since the value of our т (+.34) is slightly smaller than this, it 
might occur by chance a little oftener than five times in 100. 
If we are governed strictly by the 5 per cent criterion, there- 
fore, we cannot accept an т = .34 as significantly different 
from zero. 

For simple samples from a normal universe so large that N is 
not covered by Appendix Table 4, formula (133) is convenient 
to test the hypothesis that the observed value of r is not different 
from zero. 


(133) 


In the example above, 


ЕЕ 


v30 


The ratio of the statistic r to its standard error, called the 
critical ratio (C.R.), is 


Entering a table of normal areas (Appendix Table 1) with 
С.В. = z/s = 1.89, the probability is found to be about six 
in 100 that a larger value of r than that observed might occur 
because of random errors of sampling. Again we find that the 
value r — .34 is not quite significantly different from zero. It 
might have come from a universe in which there was no correla- 
tion at all. 

3. The Significance of gı and g;».—In a problem in Chap. IX 
we found the measure of skewness of a certain distribution 
to be gı = 1.17. The standard error of gi in large samples is 
approximately 


є = Als (134) 


——— 
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Substituting in formula (134), 
в, = М = 0.154. 


И now we divide the value of gi by its standard error, we get the 
critical ratio 1.17/0.154 = 7.6. Since this is much more than 
two standard errors, chance is ruled out, and we conclude that the 
distribution could not reasonably be regarded as a random sample 
from a normal universe. 

Let us next test the value of the measure of kurtosis, 9: = 0.86, 
found for the same distribution in Chap. IX. The standard 
error of gz for large samples is 


= lr (135) 


For N — 252, 
в, = М = 0.309. 


The critical ratio is therefore 0.86/0.309 = 2.78. Thus the 
peaked distribution in question could not have been drawn at 
random from a normally distributed universe. 

4. The Significance of the Difference between Any Two 
Statistics.—The variance of the differences, D, between n paired 
values of two variables, X; and Xs, is, by the usual formula, 


м =(D — М D)? 
CE TES 
Where M» is the mean of the differences. Or, since 


>р _ 2(Х, — X 
D=X,—X:, and м» = 20 = 208. 39, 


N 


op? = 


ZX; , 2Х:\ 
3 (c= +5?) 
Eua oru P, 
z N 


Я. — Mx) — (X: = М ха)]? 
G i a 


! For a more exact interpretation of the critical ratio g2/e,,, see L. Н. C. 
Tippett, The Methods of Statistics, 2nd. ed., page 86. 
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Letting X; — Mx, = т, and X» — Mx, = хо, 


2 — 2(t1 — 22)? 
Cher F 


Z(zi? — 2211 + 222) 


N 
_ Zr? _ 25аал ‚ Ха 
М М N 


2x12 
Noz 


By formula (81), Zzizo/No102 = То, so that 


= 01? + os? — 20100 


ср? = сі? + 01? — 27120102. 


If now we let с = e, we have 
єр? = e? + єз? i 2712е1е2, (136) 


where ¢; is the standard error of the statistic in the first sample, 
€» is the standard error of the corresponding statistic in the 
second sample, and ту» is the correlation coefficient between a 
number of sample values of the two statistics. 

Usually, correlation between two sample statistics is purposely 
introduced by drawing one sample at random, and then matching 
on some principle each of the items or values so drawn with an 
item or value from another population. For example, the 1.0.75 
of а random sample of criminals may be matched or paired with 
the I.Q.'s of their brothers. 

If the statistics are the means or proportions from two samples 
whose individual values are matched in some way, the simple 
correlation coefficient, гз, in the case of means, or 7, (tetrachoric 
correlation coefficient) in the сазе of proportions, may be used 
to determine the amount of correlation between the paired 
items of the two samples. Where correlation is believed to 
exist between two samples, but it is not known what items are 
paired or on what principle the correlation depends, it is often 
difficult to find the value of ris. Я 

When there is no correlation between the two samples, $.е.; 
when both of the samples are random or simple and so are 
independent, туз = 0, and formula (136) reduces to 


єр? = &? + e. (137) 
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Formulas (136) and (137) make no assumptions in addition 
to those involved in finding є; and єз. 

5. The Difference between Two Means.—A simpler formula 
than (136) for testing the difference between the means of two 
matched samples is 

e = VES | (138) 
where са is the standard deviation of the differences between 
the paired values, and N is the number of pairs. The c4 is 
estimated from the usual formula for the standard deviation. 
Formula (138) assumes that the experimental sample (i.e., the 
random sample that receives the ‘‘treatment”)* is a simple 
sample. 

The scores of a simple sample of brothers and their sisters on а 
personality test are shown in Table 71. Are the means of the 
two series significantly different? Correlation is evidently pres- 
ent between brother and sister, so it is necessary to calculate the 
correlation between them or else to use formula (138). We shall 
do both, for comparison. For Table 71 we have, by formula (74), 


NzXY — >Х>Ү , 
VINZX:-— СХ? — (21) 

40(1215) — 212(203) ; 
/[101,8230) = (212) [40(1,281) — (203) 
T0. 


Also, 1 
ха? 20N [(121 -(&) 
ga = (22) = (57 = VI 40)? 
04 = 1.73. 


Now, the standard error squared of the mean of the X's is, by 
formula (118) of Chap. XII, 


“See Chaps, III and IV. 
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TABLE 71.—Prrsonauity Test Scores or 40 PAIRS or BROTHERS AND 
Sisters. (HyrorHETICAL Data) 


Brother үз 4 аз 
x) 
8 25 3 9 
3 16 —1 1 
8 49 1 1 
2 4 0 0 
2 9 —1 1 
4 7 49 —3 9° | 
6 5 25 1 1 } 
5 3 9 2 4 
7 9 81 -2 4 
10 9 81 1 1 
10 8 04 2 4 
1 3 9 -2 4 
9 7 49 2 4 
8 6 36 2 4 
7 Б 25 2 4 
7 8 04 -1 1 
5 4 16 1 1 
5 6 36 а} 1 
5 5 25 0 0 
4 в 36 -2 4 
4 4 16 0 0 
4 3 9 1 1 
4 1 1 3 9 
3 3 9 0 р] 
3 2 4 1 1 
3 5 15 9 25 ES 4 
°з 4 12 " 9 16 E 1 
5 7 35 25 49 -2 4 
8 10 80 64 100 =й 4 
6 5 30 36 25 1 1 
6 4 24 36 16 2 n 
7 7 49 49 49 0 `0 
7 4 28 49 16 3 9 
5 8 40 25 64 -3 9 
4 1 4 16 1 3 9 
2 2 4 4 4 0 0 
6 5 30 36 25 1 1! 
4 3 12 16 9 1 1 
7 6 42 49 36 1 1 
5 7 35 25 49 -2 4 
212 203 1,215 1,320 1,231 9 121 
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Similarly, for the Y’s, 


2 
єм;? = n 
з _ 1231 _ (203° _ 
[io e LY. (3) = 5.02, 
єм,? = 502 = 0.18 


Substituting in formula (136), 


en? = 0.12 — 2(0.7)(0.35)(0.36) + 0.13, 
єр? = 0.07, or ep = 0.27. 


Now, 
zX _ 212 
М. = у = ау = 530, 
SY _ 203 
My = 27 = С = 5.08, 
Gren? oe = 0.81 


The critical ratio is much less than two, so the difference between 
the two means is not significant. 
Substituting next in formula (138), 


Bt vd 


/40 
which quickly gives the same value obtained by the longer 
method. 

The meaning of this result is that there is no more difference 
between the scores of brothers and sisters on a personality test 
than might be attributed to random errors of sampling. 

Suppose we had neglected the correlation between the data 
of Table 71, and used formula (137) to test the significance of the 
difference between the two means. How much would the result 
have been changed? We have, from formula (137), 


ep? = 0.12 + 0.13 = 0.25 
and 
5.30 — 5.08 
С.Р. = ————— = 
0:25 
* In testing differences, tho critical ratio (C.R.) is the ratio of the differ- 
ence to the standard error of the difference. 


0.44. 
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The correction for dependence thus almost doubled the critical 
ratio, although in this instance it did not change the verdict 
regarding the significance of the difference. 

When the hypothesis to be tested is that two simple samples 
were drawn from the same universe, the best estimate of any 
parameter is found by pooling the two samples. For example, 
if we are testing the difference between the means of two samples, 
formula (137) becomes 


2 вз 18 | 
€p -—— t м. x tx) (139) 


where М, is the number of cases in the first sample, №. is the 
number of cases in the second sample, and c? is found from the 
two samples combined by the equation 


Ni Ма 
$- > х1 — ММ", + >) X? — М.М, 
oe Nit Ns 


where X; is any value of the variate in the first sample, M x, is 
the mean of the first sample, X» is any value of the variate in the 
second sample, and M x, is the mean of the second sample. 
Table 72, below, gives the scores of 75 communities on a 
community organization test, the sample of communities being 


(140)! 


TABLE 72.—Sconzs oF 75 COMMUNITIES ON А COMMUNITY ORGANIZATION 


ТЕзт 

Score (X) a A fà уаз 
80-99 14 28 
60-79 15 15 
40-59 0 0 
20-39 —13 13 
0-19 —22 44 
"обаја — 6 100 


simple and independent of the sample of 100 communities in 
Table 69 of Chap. XII. The mean score of Table 72 is 48.4, 
and its standard deviation is 23. Let us test the hypothesis 

1 This formula gives simply a weighted mean of the two variances, с? and 


17, and should not be confused with formula (29) of Chap. VIII, which giyes 
the variance of combined distributions. 
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that these two tables represent simple samples from the same 
universe. The samples being independent by definition, we 
require formula (139). The variance of the two samples 
combined is found by formula (140), expressed in frequency 
form: 


а Вн, а, 897 


2 ds | 
"MM T >! N: 140 
T NE diss 
(-7? x co 
Q 09 [ur — Sap +100—+$ ] 
100 4- 75 М 
. €? = 493.78. 


Substituting this value in formula (139), 


5 = 493.78(x$v + 75) = 11.52, 
єр = 3.394. 


We therefore have 


Evidently, the data of the two tables might well represent 
simple samples from the same universe, as far as their mean 
values are concerned. 

If it is believed that two simple samples are from different 
Universes, and it is wanted to test whether the difference between 
their means falls within the range of chance error so that it 
Might sometimes be obliterated by sampling error, formula (137) 
takes the form 

2 2 
eic LA di NT (141) 
Applying this formula to Tables 69 and 72, we get 


406.56 | 530.77 _ 1174 
е? = “99 t 75 
єр = 3.48, 
Which is slightly larger than the standard error obtained on the 
assumption that the samples are from the same universe. Since 
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the critical ratio is only 


св = 45—484 _ 


343 = .0583, 


we interpret it to mean that if there is a real difference between 
the universes from which the two samples came, it may easily be 
reduced to zero or nearly zero in random samples. 

Sometimes two simple samples are taken from the same 
universe, and the mean of sample 1 is tested against the mean 
of the two samples combined. Correlation is thereby introduced, 
and the appropriate formula is then 

C oN: 
Єр” = ATP L NN 
Ni(Ni + М») 


where c? is found from the two samples combined, using formula * 
(140) or formula (140a). This formula leads to the same critical 
ratio as formula (139). To show this, let us test the mean score 
(48.4) of the 75 communities in Table 72 against the mean 
score (48.51) of Table 72 and Table 69 of Chap. XII combined, 
on the theory that the two samples together give a better picture 
of the universe of communities from which the samples were 
drawn than does either one alone. Substituting in formula (142) 
the values previously found, 


493.78(100) 


(142) 


2 


e» = 75(75 + 100)” 
e»? = 3.76, 
€» = 1.9396, 
(48.51 — 484) _ 
C.R. = зб = 0.0589, 


which is identical with the result previously obtained. 

6. The Difference between Two Proportions.—Although we 
have so far dealt only with the differences between means of 
samples, the same types of formula hold for other statistics. 
For example, if we are testing the difference between two pro- 
portions, the formulas corresponding to formulas (139), (140), 
(141), and (142) are, in order: 

Two simple samples from the same universe, 


di = G uy D) (143) 


тл 7i» 
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where 
= __ 1р1 + Tope 
Еи an (144) 


Two simple samples from different universes, 


= 2191, Pede 
tol nı ш ne (145) 


Two simple samples from the same universe, the proportion of 
Successes in the first tested against the mean proportion of 
Successes in the two combined, 

gi Dina. 146 
€p (лл aE Ne) M1 ( ) 

Can samples 1 and 2 of Table 73, below, be simple samples 
from the same universe, in the proportion of families having 
Seven or more members? Applying formula (143), we need the 
value of 5 from formula (144): 


_ _ (132) (135) + (184) sa) _ 
"n 132 + 134 0865, 
9 = 1.0000 — .0865 = .9135, 


Whence 
€»? = (.0865)(.9135) (ras + тз=) = -001188, 
єр = 0.0345, 
C.R. = Gs — тїз) — 1,13, | 
> 0845: NU 


Taste 73,—Two амрии or Fammres, Cuassirmp BY NUMBER or 
MEMBERS 
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From Appendix Table 1, we find a probability of about 26 in 100 
that a difference greater than that observed between the pro- 
portions in the two samples might occur by sampling error, under 
the conditions assumed. If now we use the alternative formula 
(146), we Bebe . F 


(.0865) (.9135) (134) 
А = 
e»? = 27139-4 184)1309 = 00030, 


€» = 0.0174, 
Gs — Аё 
С.В. = 0174 = 1.18 


Notice again that this is the same critical ratio as that obtained 
just above by the use of formula (143). 

T. The Difference between Two Correlation Coefficients.—To 
test the significance of the difference between two correlation 
coefficients from simple samples, the variates being normally 
distributed and independent, it is necessary to convert rı and rs 
to 21 and zs», respectively. This is readily done by means of 
Appendix Table 5. "The standard error of z is then found from 
the formula 

р 1 
€ г (147) 

Suppose for the correlation between the linguistic ability and 
leadership scores of a group of children, we find r, = .50 from 
sample 1, and rs = .60 from sample 2, where Ni = №, = 50. 
Is the difference between the two r's significant, or is it merely 
an accident of sampling? From Appendix Table 5, we find for 
T; = .50, 21 = .549, and for т: = .60, zz = .693. By formula 
(147) we calculate the standard error of 2, 


1 
= 0.146. 
5 6 


ва = 75023 = 
. Hence the standard error of the difference 22 — 21 is, by formula 


(137), 
єр? = (.146)? + (.146)? = 0.0426. 


So that 


1 Of any size. 
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Since this critical ratio is well under two standard errors, we 
infer that the difference between the two 7’s is not significant. ' 

8. Testing the Significance of a Sum.—The basic formulas 
(136) and (137) are the same for the sum as for the difference 
between two statistics, except that for the sum all the signs 
of formula (136) are positive. 

9. Testing the Hypothesis of Simple Sampling—Suppose we 
ask if Table 72 above can be a simple sample from а uni- 
verse of communities in which the mean score is 40. We have 
€x = o*/\/N = 23/4/75 = 2.66, so that 


C.R. — (48.4 — 40.0)/2.66 — 3.16. 


Since the critical ratio is greater than two standard errors, it is 
not likely that the sample is a simple sample from a universe of 
communities whose mean score is 40. There are several possible 
explanations: (1) Table 72 may be a simple sample with an 
€xtreme mean that might rarely be drawn by chance from the 
given universe; (2) it may be a sample from the given universe, 
but not taken as a simple sample should have been; or (3) it 
may not be a sample from the given universe at all. There is no 
Way to determine which of these possibilities is correct, unless 
it can be learned how the sample was actually taken. 

The purpose of testing the difference between the two 
Means in See. 5, above, might have been to discover whether 
Ог not they could occur in two simple samples from the same 
Universe, The very low critical ratio of 0.0589 suggests an 
Affirmative answer, but it cannot completely establish the fact. 
"Or example, the low critical ratio might be due to the small 
Size of the samples or to the presence of correlation, or it might 

e€ an accident not connected with random sampling. à 
he same test can, of course, be employed to determine 
Whether а sample might be random or Poisson, by merely using 
the standard error formula appropriate in each case. 

10. The Significance of the Difference between Two or More 
Tequency Distributions.—A more complete test of the hypothe- 
815 that two or more samples are simple samples from the same 
Universe jg possible by the Chi-square method, which goes 
опа the comparison of single statistics (e.g., means) to the 
Comparison of whole distributions. In Table 73 we have two 


"found from formula (118), Chap. ХП. 
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samples of families distributed according to number of members: 
We can use either formula (148) or formula (149) to find Chi 
square (x?). Formula (148) is applicable only to the case of 
two samples, when the total x? for each row is not wanted, but is 
quicker than formula (149). We shall apply it here. 


1 = 
хе 911200 Фі) — та), (148) 
(f fa 
х? = Ge = 7): Fi fy’, (149) 
where 
pin 
Pw 
TDI h, 
gsm 
f » 


fis the frequency in row r of col. 1, n. is the total of any column, 
с, nı is the total of col. 1, „т is the total of any row r, f; is any 
observed frequency, f; is any theoretical or expected frequency, 
and N is the total of the whole table. For Table 73 we have 


P = 432 = 0.496, 
q = 1 — 0.496 = 0.504, 


ии 


ipi = 15% = 0.459, 
opi = $$ = 0.488, 
SPL $$ = 0.564, 
4р1 = i = 0.609. 


To get 4p1, the frequencies of the last five rows were combined, in 
accordance with the rule that no cell should contain less than 
five expected frequencies. Substituting in formula (148), 


1 
х = AaB EOE 1500499) + 40(-488) + 22(.564) + 14(.609) 
— 132(.496)], 


1 


x? = 2.76. 
Entering a table of Chi square (Appendix Table 2) with 


т — 1 = 4 — 1 = 3 degrees of freedom | 
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(i.e., one less than the number of rows, counting the five com- 
bined rows as one), we find a probability P between .30 and .50 
that the differences between the two samples might be due to 
errors of simple sampling from the same universe. The test 
therefore fails to show that the two sample distributions of 
families differ significantly in number of members. 

The Chi-square test may also be used to investigate whether 
a sample is a simple sample from a known universe, if the distri- 
bution of the universe is known. The universe distribution, 
with N equated to that of the sample, simply takes the place 
of one of the samples in Table 73. 

To test whether more than two samples are from the same 
universe, it is necessary to find Chi square by formula (149). 
Its application to three samples in Table 74 is shown below. 


TABLE 74.—TmmEE SAMPLES OF FAMILIES, CLASSIFIED BY NUMBER OF 
MEMBERS 


Sample 1 Sample 2 Sample 3 


Мнса їп Tél 
m (fo — fo? 
Л 
— 
1-2 .00205 57.20] 1.35385 1.16452 175 
3-4 .00801 | 42| 41.18] .01633 . 00141 126 
5-6 .00857 | 17| 21.90] 1.09635 .84917 67 
7- 8 .01704 13.73| 1.62949 1.22457 20 
0520, NAL ERE ene. oio ees cS Зее Е 8 
Чейз СРР ИВЕКО Е: ЕА ИЕА 9 ий 2 РЕБЕ | ЗО 9 
1t Ое е ЕО Me ERO РЕ 2 
Н Cy n АС АА Кы gn | d Eoi t oed Lm d essen] oes 3 
3M 134.00| 4.09602 |144]143.99] 3.23907 


The expected frequency, f; in any cell is found, as explained in 
Chap. ІХ, by dividing the table total into the product of the 
TOW and column totals. For example, the expected frequency 
in the class interval 3-4, sample 2, is 126(134)/410 = 41.18; 
№ the class interval 5-6, sample 3, it is 67(144)/410 = 23.53; 
and so оп. The last five rows are combined, because four of 
them have fewer than five expected frequencies. After 2s 
bining, the expected frequencies (132) (42)/410 = 18.52. Ву 
formula (149), м 
x? = 7.37136. 
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The degrees of freedom are 
(e — 1)(r — 1) = (3 — 1)(4 — 1) = ô, 


where с is the number of columns, and т is the number of rows, 
counting all combined rows as one. Entering a table of x? with 
six degrees of freedom, we see that x? = 12.59 for P = .05. 
That is, a value of x? as large as 12.59 may be expected by chance 
five times in 100. Smaller values of x? will, of course, occur 
more often by chance. Since our value of x? is only 7.37136, it 
cannot be regarded as significant. The test therefore furnishes 
no evidence that our three samples are not simple samples from 
the same universe. 

11. The Significance of the Difference between Statistics 
from More than Two Samples.—In testing the significance of 
the differences between statistics (e.g., means) from three or more 
samples, the probability of finding a significant difference by 
chance is greater than in the case of only one difference, just as 
the probability of getting an ace at cards is greater when we draw 
twice from the deck than when we draw only once. 

A formula that takes this into account is the following: 


m — 1 1; 
C.R. =" = (150) 


where @; is the difference between any two independent statistics, 
є is its standard error, and n is the number of differences. In 


TABLE 75.—Srx SaAwPLES or 50 JuvENILE DELINQUENTS Елон, AND Ух 
CONTROL SAMPLES, SHOWING PERCENTAGES NEUROTIC 


Percentage 
nondelin- 
quents 
neurotic 


Percentage 
Samples delinquents 
neurotic 


_ 
оњомюоо ж 
NON RRO 
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Table 75 we have six simple samples of juvenile delinquents and 
six simple samples of nondelinquents, each containing 50 boys. 
Using formula (143) to find the standard error of the six differ- 
ences, we have, for the first pair of samples in the table, 


The standard errors of the other differences are found similarly, 
and entered in Table 75. Substituting from the last column 
of the table in formula (150), 


6-1 E 
C.R. = oO (—0.68) b 0.25. 


From a table of normal areas (Appendix Table 1), it appears 
that a positive or negative critical ratio greater than this might 
occur by chance over 80 times in 100 trials, so that there is no 
evidence that the samples of delinquents differ significantly 
from the samples of nondelinquents in respect to the percentages 
neurotic, J^ 

Formula (150) is applicable to any set of independent critical 
ratios, including those from random or Poisson samples, if 
random or Poisson standard error formulas are used to find the 
values of є. 


Exercises 


1. Correlate the birth rates of Table 70 of Chap. XII with the fer- 
tility ratios of the same sample counties in the table of Exercise 11 in 
Chap. XII, and test whether the correlation coefficient is significantly 
Breater than zero. . 

2. In the table of Exercise 3, below, test the hypothesis that rural 
farm and rural nonfarm families (1) are from the same universe in 
Tespect to mean size of family; (2) are from different universes, but their 
means might sometimes be approximately equal as a result of sampling 
or "n + 

3. In the table below test the assumption that urban families might 

* à random sample from a universe which is best represented by the 


ее samples combined. Use the mean as the criterion. 


27& ELEMENTARY SOCIAL STATISTICS 


SAMPLE ок 2,992 FAMILIES BY SIZE, FOR URBAN AND RURAL Areas, UNITED 
SmATES, 1930 


"AN Rural Rural 
families farm nonfarm Total 
families families 

140 34 62 236 

436 121 141 698 

384 119 120 623 

315 110 99 524 

202 88 68 358 

118 66 43 227 

66 47 27 140 

37 32 16 L 85 

20 20 9 49 

e" 10 12 5 27 

ап mes 0S mlate ele 5 6 2 13 
12 ог more*........ 4 6 2 12. 

Sample total.... 1,737 661 594. 2,992 


Universe total....| 17,400,000 | 6,600,000 | 5,900,000 | 29,900,000 


* Count as 13. 


4. Do the matched delinquents and nondelinquents in the sample 
below differ significantly in mean I.Q.? 
INTELLIGENCE QUOTIENTS OF A RANDOM SAMPLE ОЕ 25 MALE JUVENILE 
DELINQUENTS AND 25 MALE NONDELINQUENTS, MATCHED BY AGE, 
FAMILY INCOME, AND PLACE OF RESIDENCE 


Intelligence quotient 


Pair number - - 
Delinquent | Nondelinquent 


1 103 99 
2 80 92 
3 114 106 
4 100 104 
5 91 88 
6 18 80 
7 105 109 
8 98 94 
9 86 90 
10 101 97 
' 
11 92 89 
12 86 91 
18 93 90 / 
14 90 97 
15 79 84 
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INTELLIGENCE QUOTIENTS oF A RANDOM SAMPLE or 25 MALE JUVENILE 
DELINQUENTS AND 25 MALE NONDELINQUENTS, MATCHED BY AGE, 


FAMILY INCOME, AND PLACE OF REsIDENCE.— (Continued) 


Intelligence quotient 


Pair number 
Delinquent | Nondelinquent 


16 108 96 
17 82 91 
18 95 86 
19 74 83 
20 102 97 
21 105 99 
22 97 103 
23 88 91 
24 94 84 
25 99 106 
ТО а» ceteros 2,335 2,346 


5. Combining rural farm and rural nonfarm families in the table of 
Exercise 3, above, is there a significant difference between the mean 
Size of family in urban and rural areas according to this sample taken 
Proportionally at random from the two types of areas? 

6. Apply Chi-square to the table of Exercise 3 above to test (1) the 
hypothesis that rural farm and rural nonfarm families are simple samples 
from the Same universe; (2) the hypothesis that urban, rural farm, and 
rural nonfarm families аге simple samples from the same universe. 

T. Inthe table of Exercise 3 above, test the hypothesis that the means 
of the three simple samples are from the same universe. ~ 

‚ Š Test the hypothesis that families of odd sizes and families of even 
isl in the table of Exercise 3 above are Poisson samples from the 
ame stratified universe. 1 UM 
9. Test the зува й that the value of а selected statistic from each 
ОЁ the samples drawn in Exercise 4 of Chap. XII does not differ sig- 
nificantly from the known value of the corresponding parameter in the 
E lected statistic from the 

10. Test is that the value of a selected statistic from 
Sample Sage ipa of Chap. XII does not differ significantly 
*om the known value of the corresponding parameter in the universe. 


References 
Same аз for Chap. XII. 


CHAPTER XIV 
TIME SERIES ANALYSIS 


Values of a variable (е.9., infant death rates) given at successive 
intervals of time (e.g., yearly) form a time series. Such series 
are especially important in economics and are also necessary in 
the study of vital statistics, of trends in public expenditures for 
relief, and many other topics. This chapter describes methods 
for their analysis. 

As an illustrative problem, let us inquire what the state of 
Wisconsin has accomplished in reducing infant mortality. 
Figures giving deaths per 1,000 live births from 1908 through 
1935 are shown as a time series in Table 76. 


Taste 76.—Wisconsin INrANT Мовтлитү Rares, 1908-1935* 


Infant deaths Infant deaths 
Year per 1,000 live _ Year per 1,000 live 
births births 
1908 107 | 1922 70 
1909 120 | 1928 70 
1910 109 | 1924 64 
1911 103 | 1925 67 
1912 95 1926 67 
1913 97 1927 60 
1914 83 | 1928 61 
1915 78 | 1929 60 
1916 86 | 1930 56 
1917 78 | 1931 53 
1918 79 1932 50 
1919 79 1933 48 
1920 77 1934 49 
1921 72 1935 46 


* Report of the Wisconsin Bureau of Vital Statistics, 1934-1935, p. 284. 
The first step that is usually taken in time series analysis is to 
plot the data. This is done for Table 76 in the lower graph of 


1 Birth rate, death rates, marriage rates, and so on. ; 
2176 
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Fig. 55. Examination of this figure shows a striking decline in 
infant mortality in Wisconsin over a 28-year period. 


120 
110 
[mi 
100 ПП ro 
AN [| 
= 
90 
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» Wisconsin. Registration Area, 
E eo анин 
a А e 
9 
А 
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60 
50 B 


400870 2 ‘Id 16 18 1920 22 '24 '26 '28 1930 '32 '341935 
Year 
Note: Dots indicate moving averages 
i ИВ БӨР nited States, 1908- 
Ju (oc ro M 
1. The Secular Trend: A Straight Line.—Suppose, further, 
lat We want to compare the infant mortality record in Wisconsin 
With that of other states in the United States. Data for the 
original birth registration area of 10 states and the District of 
Clumbia! are available for the period 1915 through 1933, and 
“te entered in Table 77. They are plotted as a dotted line in 
Fig. 55. It is seen, from this figure, that infant mortality has 
Sen less in ОЕ than in the original registration area 
y Michigan, New Hampshire, New 


?nneetieut, Maine, Massachusetts, Hes bia. 
York, Pennsylvania. Rhode Island, Vermont, and the District of Columbia. 
Л 
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throughout the entire period of comparison. But has the rate 
of decline since 1915 been greater in Wisconsin or in the 
original registration area? 


TABLE 77.—INFANT MORTALITY IN THE ORIGINAL BIRTH REGISTRATION 
ÅREA OF THE ОмттЕР БтАТЕЗ, 1915-1933* 


Infant deaths Infant deaths 
Year per 1,000 live Year per 1,000 live 
births births 
1915 100 1925 74 
1916 100 1926 75 , 
1917 96 1927 64 
1918 106 1928 67 
1919 89 1929 65 
1920 90 1930 62 
1921 79 1931 60 
1922 79 1932 55 
1923 79 1933 53 
1924 72 


АС 
* From Birth, Stillbirth, and Infant Mortality Statistics, 1933, p. 7, О. S. Bureau of the 
Census. 


The answer is to determine which of the two series has the steeper 
slope. Inspection shows that both graphs are irregular and 
saw-toothed in shape, so that the slope sometimes of one and 
sometimes of the other is the steeper. What we must do is to 
remove the irregularities in the two series, t.e., reduce them to 
smooth curves. To do this is to find the secular trend, meaning 
the general direction of the series over a considerable period of 
time, freed from confusing oscillations. To answer our par- 
ticular question in the present case, it seems appropriate to fit 
straight lines to the data, since more complex smooth curves do 
not describe the declining death rates any better. 

We have already learned to fit a straight line by the device 
of least squares, in determining the regression equation in linear 
correlation. The normal equations for finding the values of a 
and b in the line of best fit are 


ZzY 
C= My, (152) 


TIME SERIES ANALYSIS 279 


where z is a deviate from the midyear of an odd! number of years, 
Y is the infant death rate, and the origin is at the midyear or 
mean of the X's. The values of a and b so found are substituted 
in the equation of the straight line. 
Y, = a + bz. (153) 
We set up Table 78 to obtain these values for the Wisconsin 
series, and Table 79 for the registration area series. 


TABLE 78.—FrrrING A STRAIGHT LINE TO THE Wisconsin Data or TABLE 76 


Year | Infant death rate 
Year (с) (Y) ziY 2\% 
1915 —9 78 —702 81 
1916 —8 86 —688 64 
1917 -7 78 —546 49 
1918 —6 79 —474 36 
1919 —5 79 —395 25 
1920 —4 77 —308 16 
1921 —3 72 —216 9 
1922 —2 70 —140 4 
1923 —1 70 — 70 1 
1924 0 64 0 0 
1925 1 67 67 1 
1926 2 67 134 4 
1927 3 60 180 9 
1928 4 61 244 16 
1929 5 60 300 25 
1930 6 56 336 36 
1931 7 53 371 49 
1932 8 50 400 64 
1933 9 48 432 81 
SY = 1,275 EzY = —1,075 Ez = 570 
M, = 67.1 


From Table 78, 


—1,075 _ _ ч 
= EQ = —1886, 


11 the series includes an even number of years, one of them may be 
dropped to give an odd number, so that the convenient short formulas (151) 
and (152) may be used, instead of the more laborious normal equations for a 
Straight line given in Chap. X. 


bi 
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and 
01 = 67.1, 
so that, approximately, 
У, = 67 — 1.8921 (153а) 


is the equation of the straight-line secular trend fitted to infant 
mortality rates in Wisconsin with origin at 1924. 

Similarly, from Table 79, we find the equation of the straight- 
line trend through infant mortality rates in the original registra- 
tion area of the United States to be approximately 


Yo = 77 — 2/1525. (1535) 


TABLE 79.—FrrrING А STRAIGHT LINE TO THE ORIGINAL REGISTRATION 
Area Dara or TABLE 77 


т 


—900 (see Table 78) 


—158 
— 9 
0 


74 
150 
192 
268 
325 


372 
420 l 

440 

447 

ХҮ =1,465 | zzY = —1,569| Zr: = 570 
M, = 77.1 


We are now in а position to answer the question, Does the 
trend line for Wisconsin or that for the original registration area 
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have the steeper slope? We see that the slope of the Wisconsin 
line is b; = — 1.89, whereas the slope of the original registration 
area line is bg = —2.75. The negative signs mean that as x 
increases, ?.е., as time passes, Y, the infant death rate, decreases. 
Evidently, the trend of infant mortality has been decreasing 
2175 = 1.5 times as fast in the original registration area as in 
Wisconsin. The two lines of trend are plotted in Fig. 55 by 
substituting appropriate values of x in equations (153a) and 
(153b). For example, the ordinate of the line through the 
Wisconsin data is, if z; = —9, У, = 67 — 1.89 (—9) = 84; and 
if zı = 9, У, = 67 — 1.89 (9) = 50; so that the line is drawn 
through the points (—9, 84) and (9, 50). 

In terms of percentages, the infant mortality rate declined an 
average of 2.13 per cent per year in Wisconsin, as compared with 
2.56 per cent in the total registration area. 

2. The Secular Trend: A Moving Average.—It is an important 
principle that any line or curve used to represent the secular 
trend of a series should be rather simple in form—a straight 
line if that is at all reasonable, otherwise seldom anything more 
complex than a second degree parabola (Y = a + bX + cX?). 
The reasons are that a trend line that follows the original data 
too closely includes cyclical variations from which the secular 
trend should be freed, and it also fails to fulfill the primary 
Purpose of a trend line, which is to show clearly the general 
direction, up or down, in which the series is moving. 

Of course, a straight line may be a very poor fit for some 
Series, so that if we want to generalize the trend without doing 
too much violence to the data we may need to fit another kind of 
Curve, say, a parabola. Although the formulas differ, the general 
Principles are the same. 

A second method of determining secular trend, which usually 
allows the trend line to follow the original data more closely 
than а straight line does, should be explained. This is the 
method of the moving average, which is shown in Table 80. 
It is again preferable to average an odd number of years, because 
the results can then be more conveniently centered at a given 
Year in the series. If cycles appear in the original series, the 
length of the moving average should be equal to the average 
Period of a cycle from peak to peak, or some multiple thereof, if 
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the purpose is to represent the secular trend. But if the moving 
average is used only to smooth out random fluctuations, its 
length should be less than that of an average cycle period. 
The shorter the period of the moving average, the more flexible 
is the resulting curve. Inspection of Fig. 55 suggests the pres- 
ence of possible cycles of about seven years in length in both 
series. Accordingly, moving averages of seven years are shown 
in Table 80. 


TABLE 80.—SEvEN-yBAR MOVING AÁvERAGES OF INFANT MORTALITY RATES 
IN WISCONSIN AND IN THE ORIGINAL REGISTRATION AREA OF THE 
Unirep STATES, 1915-1933 


Mortality rates Seven-year moving averages 


сс Wisconsin Registration Wisconsin Registration 
area area 
1915 78 100 
1916 86 100 
1917 78 96 
1918 79 106 94 
1919 89 91 
1920 90 88 
1921 72 79 85 
1922 70 79 80 
1923 70 79 78 
1924 72 75 
1925 74 73 
1926 67 75 71 
1927 60 64 68 
1928 61 67 67 
1929 65 64 
1980 62 61 
1931 53 60 
1932 50 55 


53 


The method is simply to add the firs 
and divide by 7. Thus, 


(78 + 86 + 78 + 79 + 79 + 77 + 72 = 549) = 78.4. Then 


t seven values of the series, 
for the Wisconsin series, we have 


the first value in the table, 78, is dropped, and the eighth value, 


TIME SERIES ANALYSIS 283 
70, is added, and again the sum is divided by 7: 


(549 — 78 + 70)+ = 51 = 77.3; 
and so on. 

Notice that a disadvantage of the moving average is that it 
reduces the length of the series by one less than the number of 
years averaged, or in this case 7 — 1 = б years. When the 
Moving averages are plotted as large dots in Fig. 55, it is 
seen that they give trend lines that agree very closely with the 
poet lines of best fit, especially in the case of the Wisconsin 

ata. 

It is helpful in selecting a secular trend to note that “И the 
actual data fall consistently above or below a line of trend for a 
Considerable period, it is probable that the fit is not good.”! 
This is not the case in Fig. 55. 

З. Short-term Cycles.—The cycles in the Wisconsin and 
registration area series may be shown more clearly than in Fig. 55 


Gk 


+10 vus. Original 
\ Registration Area 
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Percentage deviation from trend 


о 


X 
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ear we А 5 
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5-1933: cyclical deviations from linear trends. 
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а, 56.—Infant mortality г 
(теа of the United States, 191 
Tom Table 82 ) 


by €xpressing the original rates as percentages of the trend, using 
for the latter either the values lying on the straight lines of best 
t or the moving averages just found. If we choose the former, 
the results are shown in the last two columns of Table 81. Thus, 
from Table 80, for 1915, we have 78, and from Table 81, 84.01, so 
that 100(78/84 01) = 92.85. Any cyclical tendencies in these 
Percentages of trend will stand out even more if we subtract 
0 per cent from each of them, thus expressing them as positive 
1Е. С. Minis, Statistical Methods, p. 290, Henry Holt and Company, Inc., 
New York, 1924, 
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and minus deviations. This is done in Table 82,1 and the 
resulting cyclical deviations are plotted in Fig. 56. 

From Fig. 56 it appears that only short and erratic cycles 
occur in infant mortality rates in Wisconsin and in the original 


TABLE 81.—Inrant MORTALITY RATES IN WISCONSIN AND IN THE ORIGINAL 
REGISTRATION AREA OF THE Омтео STATES, 1915-1933 
Straight-line Trend Values and Observed Values as Percentages of the 
Trend Values 


Linear trend Observed rates as 
values per cent of trend 
Year 
Wisconsin Registration Wisconsin Registration 
area, area 
1915 84.01 101.75 92.85 98.28 
1916 82.12 99.00 104.72 101.01 
1917 80.23 96.25 97.22 99.74 
1918 78.34 93.50 100.84 113.37 
1919 76.45 90.75 103.34 98.07 
1920 74.56 88.00 103.27 102.27 
1921 72.67 85.25 99.08 92.67 
1922 70.78 82.50 98.90 95.76 
1923 68.89 79.75 101.61 99.06 
1924 67.00 77.00 95.52 93.51 
1925 65.11 74.25 102.90 99.66 
1926 - 63.22 71.50 105.98 104.90 
1927 61.33 68.75 97.83 93.09 
1928 59.44 66.00 102.62 101.52 
1929 57.55 63.25 104.26 102.77 
1930 55.66 60.50 100.61 102.48 
1931 53.77 57.75 98.57 103.90 
1932 51.88 55.00 96.38 100.00 
1933 49.99 52.25 96.02 101.44 
[гы ee eee eS ee 


registration area over the period 1915 through 1933. Slightly 
different results would have been obtained if the moving average 
instead of the straight line had been used as the index of trend. 


1 Notice that the first two columns of Table 82 should each sum to лего. 
They fail to do so because we disregarded decimals in the equations of the 


lines of best fit. 
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TABLE 82.—Inranr MORTALITY RATES IN WISCONSIN AND TEB ORIGINAL 
REGISTRATION AREA OF THE UNITED Srares, 1915-1933 
Percentage Deviations from Straight-line Trends 


Percentage deviations 


from trend 
Year , 
Wisconsin Registra- 
(2) tion area 
(0) 
1915 —7.15 — 1.72 
1916 +4.72 + 1.01 
1917 —2.78 — 0.26 
1918 +0.84 +13.37 
1919 +3.34 — 1.93 
1920 +3.27 + 2.27 
1921 —0.92 — 7.33 
1922 —1.10 — 4.24 
1928 +1.61 — 0.94 
1924 —4.48 — 6.49 
1925 +2.90 — 0.34 
1926 +5.98 + 4.90 
1927 —2.17 — 6.91 < 
1928 +2.62 + 1.52 6.86 2.31 3.98 
1929 +4.26 F 84И 18.15 7.67 11.80 
1930 +0.61 + 2.48 .37 6.15 1.51 
1981 —1.43 + 3.90 2.04 15.21 — 5.58 
1932 —3.62 0.00 13.10 0.00 0.00 
1933 —3.98 + 1.44 15.84 2.07 | — 5.78 
чүсү л snos. +2.52 + 3.50 | 233.65 | 411.68 118.26 


To compare the amounts of fluctuation of the two series 
around the line of trend, the percentage deviations of Table 82 
are squared and summed, giving for the Wisconsin series, 

Ул? (x) i. mm E (22) = 3.51, 


RT WU AUN 19 


and for the original registration area, 


411.08  (3.50Y | А 
Aen i - 350) = 4.65. 
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We therefore conclude that the original registration area series is 
1.32 times as variable as the Wisconsin series. Some of this 
difference is due to the abnormal rates of the war year 1918. 
We would expect such a result, as conditions affecting infant 
health are probably more variable over the whole registration 
area than in the single state of Wisconsin. 

4. Correlation between the Short-term Cycles of Two Time 
Series.—Inspection of Fig. 56 shows that infant mortality rates 
tend to rise and fall together in Wisconsin and in the original 
registration area. This resemblance between the apparently 
erratic fluctuations of the two series may be symptomatic of the 
existence of general factors that produce cycles in infant deaths. 
The point is important enough to test with some care. We may 
ask, just how much relationship is there between the variations 
in infant mortality rates in Wisconsin and in the original registra- 
tion area? To answer this question we need to know the value 
of the coefficient of correlation between the two time series, 
taking the deviations from the trend lines, as given in Table 82, 
instead of from the means of the series. It will be recalled 
that the formula for the Pearsonian coefficient of correlation is 


1 Za! — МММ, 


r 
Nozoy 


Taking the sum of the cross products, Zz'y' = 118.26, from 
Table 82, N = 19 years from 1915 to 1933 inclusive, c» = 3.50, 
and су = 4.65, as found above, we have ^ 


` o (2.52\ (3.50 
118.26 — 19 (222) (220) | 


Е ое a a 
118.26 — 0.46 о 

ЗО ue 

r2 = 0.15. 


Зо that the relationship between infant mortality rates in 
Wisconsin and in the original registration area from year to 
year enables us to predict one from a knowledge of the other 
only 15 per cent more accurately than if we judged one of the 
series from a knowledge of its own mean and variance. 

Could it be that a correlation coefficient of r = .39 is due to 
random accidental correspondence between the cyclical fluctua- 
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tions in the two series? Although we are dealing here with two 
historical series, we have removed the secular trend, and this is 
sometimes regarded as warrant for applying the standard error 
to this situation. An inspection of the cycles in Fig. 56, how- 
ever, suggests that some correlation between successive years 
still remains, so that we can hardly assume that the death 
rates in our series, regarded as a sample, are independent of one 
another. Under these conditions, the basic assumptions of 
simple sampling underlying the standard error formula which is 
appropriate in this case, viz., e = 1/ММ — 1, are violated; 
во we are unable to answer the question asked at the beginning 
of the paragraph. However, the absence of much correlation 
between the Wisconsin rates and the original registration area 
rates suggests that the control of infant mortality is prima- 
rily а local problem. This should be further tested by com- 
paring infant mortality rates in Wisconsin with those in adjoining 
states. | 

It is just as important to avoid the distorting effects of one 
or a few atypical, extreme values in correlating times series as in 
other correlation problems (see Chap. X, Table 49). For exam- 
ple, in Fig. 56 it appears that the war year 1918 was decidedly 
abnormal in its infant mortality rate, and the same is to some 
extent true of the depression year, 1933. In those two years 
there is much less agreement than usual between the two series. 
If we are interested primarily in knowing the amount of correla- 
tion between infant death rates in Wisconsin and in the registra- 
tion area in normal years, it is, of course, desirable to omit the 
two atypical years from the computation of the correlation 
coefficient. This would make it necessary to fit a new trend 
line to the remaining 17 years of the series, and find the coefficient 
of correlation between the percentage deviations from it. In 
case we do not want to confine the investigation of the amount 
of association between the two series to “normal” years, which 
are not always easy to define objectively, and yet we do want to 
reduce the influence of the extreme or atypical values, it is 
probably advisable to resort to the coefficient of rank correlation. 
This coefficient, p, is calculated from Table 83, and has a value 
of .41. 

62D? 6(678) _ 
p=1— qa) mare — 1 
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TABLE 83.—Влмк or OBSERVED RATES As PER CENT or TREND 
Year Wisconsin Revistration D D? 
area 
1915 1 6 — 5 25 
1916 18 11 7 49 
1917 5 9 —4 16 
1918 11 19 — 8 64 
1919 16 5 11 121 
1920 15 14 1 1 
1921 9 1 8 64 
1922 8 4 4 16 
1923 12 7 5 25 
1924 2 3 = wl 1 
1925 14 8 6 36 
1926 19 18 1 1 
1927 6 2 4 16 
1928 13 13 0 0 
1929 17 16 1 1 
1930 10 15 5 25 
1931 7 17 —10 100 
1932 4 10 -6 36 
1933 3 12 9 81 
Total... 678 


As expected, the result of using ranks in this case is to increase 
the amount of correlation somewhat. 

It often happens that the correlation of two time series is 
greater if one of them is lagged one or more years, so that the 
cycles correspond more closely. For example, if the marriage 
rate declines sharply, so does the birth rate, but not until about 
a year later. Therefore, to test the relationship between mar- 
riage and birth rates, the latter should be lagged by one year. 
That is, say, the 1930 birth rate should be paired with the 1929 
marriage rate, ete. There is no indication that a lag is needed 
in correlating the two series with which we were dealing above. 

5. Seasonal Fluctuations.— Data such as infant mortality 
rates may be obtained by months as well as by years. This 
affords an opportunity to study the seasonal fluctuations in 
infant deaths, i.e., the variations in death rates that are associ- 


1 
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Tasis 84.—Inrant Мовтамтх Rates вт Момтнѕ, UNITED STATES 


REGISTRATION ARBA, 1928-1935* 


nfant onthly 


шору моні Observed ИЯ 
rate рег Monthly rate as lof observe 
Year | Month | 1,000 live pereent | ratesas | Seasonal | Gyeles 
births in Prasad | percent’ | indes | 09-6 
ваше (2) = (3) Ж100] of trend 
month (Tabl 
(1a) a5) (2) (7) 
1928 Jan. 72.4 Ж = 8.79) 
Feb. 73.2 s — 6.25 
Mar. 74.8 Б = 1.04 
Арг. 75.0 x + 5.65 
May 70.4 . + 5.10 
June 64.2 a + .16 
July 60.8 . — 1.09 
Aug. 60.2 2 . + 1.28 
Sept. 63.4 4 + 2.03 
Oct. 64.3 8 a — 2.32 
Nov. 65.2 2 97.84 — 1.52 
Dec. 81.3 9 108.08 312.31 
1929 Jan 99.1 67.39 147.05 113.35 +33.70 
Feb. 84.8 67.22 126.15 112.20 413.95 
Mar. 74.3 67.07 110.78 108.56 + 2.22 
Apr. 66.1 66.91 98.79 103.39 — 4.60 
ay 63.9 66.76 95.72 97.49 — 1.77 
June 57.8 66.60 86.79 93.60 — 6.81 
July 55.7 . 8 90.10 — 6.28 
Aug. 57.7 87.04 0.00 
Sept. 63.4 91.21 + 4.66 
Oct. 64.9 97.10 + 1.26 
Nov. 59.4 97.84 — 7.59 
Dec 65.2- 108.08 — 8.80 
1930 Jan. 67.8 113.35 — 9.85 
Feb. 69.8 112.20 — 5.41 
Маг. 69.3 108.56 — 2.27 
Арг. 68.2 103.39 + 1.45 
Мау 62.5 97.49 = 1.17 
June 61.4 93.60 + 1.24 
July 59.3 90.10 + 1.72 
Aug. 56.0 87.04 = .12 
Sept 61.7 91.21 | + 4.79 
Oct. 67.1 97.10 + 7.55 
Nov. 63.5 97.84 + 1.44 
Dee. 69.8 108.08 + 1.31 
1931 Jan. 75.3 113.35 + 4.95 
Feb. 74.6 112.20 + 5.30 
Mar. E 108.56 + 2.59 
Apr. Г 103.39 Р 
Мау 4 97.49 = 
June 93.60 = 
say ~ d 
ug. = 
Sept. J% 
Oct. + 
—15.57 
1932 —21.91 
98 —18.90 
— 6.41 
— 5.54 
- 3. 
= 1. 
+ 
= 3. 
-— 9. 
—10. 
í Е + 2. 
73.0 60.08 121.50 +13. 


* From Births, 
publication. 


Stillbirths, and Infant Mortality, U. В. Bureau of the Census, annual 
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Tarım 84—Inrant Мовтлытх Rares ву MoxrHs, UNITED STATES 
REGISTRATION AREA, 1928-1935.*—(Continued) 


Infant Monthly 
mortality ANS Observed averages | 
rate per Mon! y rate as of observet S 1 Cycles 
ear | Month | 1,000 live | trend per cent rates ns сэз ра усе 
m Births in | rates of trend | percent | index | (4)-(6) 
same (2) = (3) X100| of trend 
month (Table 87) 
(1a) (15) (2 (3) (4) (5) (6) (7) 
1933 | Jan. 113.34 113.35 | + 5.48 
Feb. 112.19 112.20 | + 4.75 
Mar. 108.55 108.56 | — 7.74 
Арг. 103.38 103.39 | — 8.70 
May 97.48 97.49 — 5.25 
June 93.59 93.60 | + 1.41 
July 90.09 90.10 | — 2.12 
‘Aug. 87.03 87.04 | — 1.89 
Sept. 91.20 91:21 | + 2.35 
Oct. 97.09 97.10 | + 3.21 
Nov. 83 97.84 | + 1.53 
Dec 108.07 108.08 | — 7.58 
1934 | Jan. 113.34 113.35 | — 8.98 
Feb. 112.19 112.20 | + 2.65 
Маг. 108.55 108.56 | + 8.67 
Арг. 103.38 103.39 i 9.30 
May 97.48 97.49 8.36 
June 93.59 93.00 | +11.50 
July 90.09 90.10 | +11.95 
Aug. 87.03 87.04 | + 4.59 
Sept. 91.20 91.21 | — 140 
Oct. 97.09 97.10 | + 3.50 
Nov. 97.83 97.84 | + 7.10 
Dec. 108.07 108.08 | + 5.14 
1935 | Jan. 113.34 113.35 | + 5.35 
Feb, 112.19 112.20 | + 3.79 
Mar. 108.55 108.56 | + 2.93 
Арг. 103.38 103.39 | + 1.76 
ee 97.48 97.49 | + 5.62 
June 93.59 93.60 F 
July 90.09 90.10 
‘Aug. 87.03 87.04 | — 
Sept. 91.20 91.21 | — 
Oct. 97.09 97.10 | — 
Nov. 97.83 7.84 | + 
Dec. 108.07 108.08 | — 


*From Births, Stillbirths, and Infant Mortality, U. 8. Bureau of the Census, annual 
publication. 


ated with spring, summer, fall, and winter. To do this, we must 
first separate the seasonal fluctuations from the secular trend, 
the short-term cycles, and the random fluctuations, all of which 
appear in the original monthly rates given in col. (2) of Table 84. 
We average the 12 monthly rates in each year in Table 84 to 
obtain annual rates, which are entered in Table 85, and plotted 
in Fig. 57. Inspection of Fig. 57 shows 2 decline in the infant 
mortality rate in five out of seven years, and suggests that & 
straight line probably is most appropriate to represent the 
secular trend. Table 85 shows the calculations needed to fit а 
linear trend to the annual rates by the method of least squares. 
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pee Е et 
1928 1929 1930 1931 1932 1933 1934 1935 
ear 
тга. 57.—Annual infant mortality rates in the registration area of the United 
States, 1928-1933. (From Table 85.) 


TABLE 85.—VALUES NEEDED FOR FITTING A STRAIGHT LINE TO THE ANNUAL 
IuraNT MomrALITY RATES IN THE REGISTRATION AREA OF THE 
Unirep SrATES, 1928-1935 


Infant | Year A. 
Year death code XY ue nd икс 

rate (Y) a’) values rom treni 
1928 68.767 | —3 —206.301 68.388 + .379 
1929 67.692 | —2 —135.384 66.524 +1.168 
1980 64.700 | —1 — 64.700 64.660 + .040 
1931 61.592 0 0.000 62.796 —1.204 
1932 57.700 | +1 54.700 60.932 —3.232 
1933 58.375 | +2 116.750 59.068 — .693 
1934 60.242 | +8 180.726 57.204 | -3.038 
1935 55.842 | +4 223.368 55.340 + .502 

И Бакс ines à: В а, 
169.159 | 44 — .002 


Substituting the values found in Table 85 in the normal equa- 
tions for determining the constants in the equation of a straight 


line, we have 
_ EX'Y — МММ, 

УХ” — ММ." 
= М, — bM, 
_ 169.159 — 8(0.5) (61.864) 
b 44 — 8(0.25) 4 
—1.864, 
61.864 — (— 1.864) (0.5), 
= 62.796, 


1. 


пае c n 
\ 


во that 
Y. = а +bX', 
Y, = 62.796 — 1.864X'. (154) 
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From formula (154) the trend values shown in the next to the 
last column of Table 85 are estimated by substituting for X’ its 
successive values taken from the third column of the table. The 
annual trend line is plotted in Fig. 57. The last column of 
Table 85, obtained by subtracting the trend values from the 
observed Y values, is inserted as a check on the arithmetic. 
Its sum is approximately zero, as it should be if the calculations 
are carried far enough. 


100 


1928 1929 1930 1931 1932 1933 1934 1935 
Year 

Ета. 58.—Monthly infant mortality rates in the registration area of the United 

States, 1928-1935. (From Table 84.) 


Since in each year the infant death rate declines on the 
average 1.864, in one month the decline is 1.864/12 = 0.1553. In 
Table 85 we used average annual rates, which apply to the 
middle of а year. 'The middle of the year falls on June 30. 
The average monthly rates, however, apply to the middle of 
each month. We, therefore, enter Table 84 at June, 1928, and 
add to the annual 1928 trend rate of 68.388 one-half of the 
correction factor, 0.1553, so that we have 


68.388 + .0777 = 68.4657 


as the June, 1928, monthly trend in col. (3) of Table 84. We 
then add 0.1553 accumulatively to this rate for the five preceding 
months in 1928, and subtract 0.1553 accumulatively from it for 
each subsequent month throughout the eight-year period, which 


* 
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completes col. (3). The monthly trend line, which is identical 
with the annual trend line, and the observed monthly rates from 
col. (2) of Table 84, are plotted in Fig. 58. From this graph, it is 
seen that in spite of a general downward trend in infant mortality 
rates, these rates have fluctuated considerably, so that even in, 
say, early 1935 they were much higher than in the middle of 1928. 
How much of this variation is due to the season of the year? 


TABLE 86.—Frequency DISTRBUTION OF OBSERVED INFANT Morrauiry 
RATES EXPRESSED AS PERCENTAGES OF TREND, BY MoNTHs, UNITED 
Srares REGISTRATION AREA, 1928-1935* 


Observed 
rates, per cent| Jan. | Feb. | Mar. | Apr. | May 
of trend 


June | July | Aug. | Sept.| Oct. | Nov. | Dec. 


145-149 7 


140-144 


135-139 


130-134 


125-129 / 


120-124 // 


115-119 | /7/ | /// 

' 110-114 / 

105-109 | a / 
7 


100-104 /// 


95- 99 


=: 2002] 
90-94 | / 


__ 
85- 89 

——— 

80- 84 


* From col. (4), Table 84. 


Column (4) of Table 84 shows the monthly observed rates 
expressed as percentages of the monthly secular trend a 
These percentages represent the seasonal variations combine: 
with the short-term cycles and the random fluctuations, but 
with the secular trend eliminated. To remove the short-term 
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cycles and random fluctuations, it is necessary to average the 
percentages for each month, over the eight-year period. As an 
aid in revealing whether or not a seasonal movement actually 
exists, and in choosing the most stable kind of monthly average, 
Table 86 is set up. A glance at it shows clearly the presence of a 
seasonal pattern in infant mortality. The death rate is high in 
the winter and low in the late summer. From the arrangement 
of the frequencies in the several columns, it appears that, except 
possibly in January, the arithmetic mean is a suitable average 
to use in this case. As a rule, however, it is recommended to 


S arab. Mar. Apr. May June July Aug. Sept. Oct. Nov. Dec. 
onth 
Fia. 59.— Seasonal indexes of infant mortality rates in the registration area of the 
United States, 1928-1935. (From Table 87.) 
average the middle three or four values for each month, a sort 
of. combined mean and median average which avoids the dis- 
tortion due to extreme values. In Table 87 the mean monthly 
values are found! and are entered in col. (5) of Table 84. То 
convert the 12 mean monthly percentages to index numbers, 
they are divided by their own average, 99.99, and the quotient 
multiplied by 100, to give the last row of Table 87 and col. (6) 
of Table 84. The index of seasonal variation has an advantage 
over the simple percentages of col. (5) of Table 84, in that they 
vary around a mean of exactly 100.00 per cent, and are therefore 
more generally comparable and finished in form. In the seasonal 
indexes of col. (6) of Table 84 there now remains only the seasonal 
variation, since the secular trend, cycles, and random fluctua- 
tions were removed by the steps just taken, An undistorted 
1 From col. (4) of Table 84. 


m Е 
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idea of the seasonal variation can now be obtained by plotting 
the monthly seasonal indexes around their mean of 100 per cent, 
as in Fig. 59. It is again obvious that the winter months are 
the danger period for infants. 

6. Short-term Cycles Freed from Seasonal Fluctuations.—If 
it is wanted to observe the short-time cycles mixed with random 
fluctuations in the monthly infant mortality rates, freed from 


10 
Д | 
ЛАУ А ПАГ VN 
ГАИ PA oe 
у ОШ ШШЕ А] 


Per cent of trends 
[=] 


| 
P 4 | 


1928 1929. 1930 ju 1932 1933 1934 1935 


Ту. 60.—Short-term cycles and random саноа in infant mortality rates, 
United States registration area, 1928-1935. (From Table 84.) 
both the secular trend and the seasonal movement, this may be 
done by recording in col. (7) of Table 84 the differences between 
the percentages of trend in col. (4) and the seasonal indexes in 
col. (6), and plotting them in Fig. 60. It appears that a number 
of other factors besides the season of the year affect the infant 
death rate, and need to be studied and brought under control. 
There is no suggestion from Fig. 60 that any progress was made 
during the eight-year period in reducing the percentage of 
infant deaths due to cyclical and random causes. The point 
might be tested by obtaining the standard deviations around 
zero of the differences in col. (7) of Table 84, for the first two 
years and the last two years of the period, and comparing the 


two standard deviations. 
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TABLE 87.—CALCULATION OF Момтных Mzaws or INFANT MORTALITY 
RATES EXPRESSED AS PERCENTAGES OF TREND, UNITED STATES 
REGISTRATION AREA, 1928-1935 


May | June 


July Oct. | Nov. 


102.59] 93.76) 89.01) 88.32) 03.24) 94.78] 96.32]120.39 
95.72) 86.79) 53.82] 87.04] 95.87] 98.36] 90.25) 99.28 
90.32] 94.54] 91.82) 86.92] 96.00]104.65| 99.28|109.39 
89.48] 84.94) 86.26] 86.50] 93.73] 97.99] 93.56] 92.51 
94 49) 91.95) 90.71) 83.86] 82.09] 86.93/100.28|121.50 
92.24) 95.01) 87.98] 85.15] 93.56|100.31| 99.37/100.50 
105.85/105.10/102.05| 91.63] 90.81|100.60]104.94|113.22 
103.11] 96.36) 89.03) 86.55] 84.26] 93.07] 98.65]107.73 
779. 80|748.75|720. 68/696 .27|729 .56)776. 69/782. 65)864.52 
97.48] 93.59) 90.09] 87.03] 91.20] 97.09) 97.83/108.07 
87.04] 91.21) 97.10] 97.84|108.08 


Exercises 

1. Compare the trends in the birth rates of cities and of rural areas 
in the original registration area of the United States over the 19-year 
period, 1915 through 1933, using the data in the table below. Show 
the cyclical deviations from trend, compare the variability of the two 
series, and calculate the amount of correlation between the fluctuations 
of the two series. What should be done with the data for extremely 
atypical years, such as the war year, 1918? Is the correlation improved 
by “lagging” one of the series? Plot all data. 


Вїнтн Rares PER 1,000 POPULATION ror CITIES AND RURAL АВЕАВ IN THE 


ORIGINAL REGISTRATION AREA or THE Unrrep STATES, 1915-1933* 
== ies eae ee ee ee Sed a 4 


Birth rate Birth rate 

Year Year 

Cities Rural Cities Rural 
1915 26.0 23.8 1925 22.2 20.3 
1916 26.0 23.5 1926 21.5 19.1 
1917 26.4 23.3 1927 21.3 19.0 
1918 25.8 23.0 1928 20.5 18.1 
1919 23.8 Рот 1929 19.7 16.8 
1920 24.6 22.2 1930 19.3 16.7 
1921 24.5 23.1 1931 17.8 16.3 
1922 22.9 21.8 1932 17.0 15.5 
1923 22.9 21.1 1933 15.8 14.8 
1924 23.2 21.2 


ии 
* From Birth, Stillbirth, and Infant Mortality Statistics, 1935, pp. 5-6, Bureau of the 
Census. 
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2. For the relief data in the accompanying table show the secular 
trend and the seasonal fluctuations, and plot the results in each case. 


SÉ 
Момвев or Cases REcEIVING RELIEF IN 385 RURAL AND Town Arpas OF 


THE Unrrep Srares, 1932-1936* 


Cases 
Month 
1932 1933 1934 1935 1936 
BIDEN. iius see aaa 99,064 | 169,554 | 298,785 | 145,784 


107,860 | 177,041 | 209,217 | 146,697 
128,704 | 202,551 | 290,217 | 143,000 
121,234 | 216,463 | 279,901 | 131,038 
112,079 | 222,647 | 266,014 | 123,102 
110,158 | 232,331 | 244,074 | 117,808 
131,850 | 239,441 | 227,814 | 120,067 
126,572 | 259,410 | 218,883 | 128,303 
114,147 | 255,929 | 204,745 | 129,124 


February.. 


September.. 

October... 38,126 | 117,459 | 251,397 201,341 | 144,492 
November. 65,922 | 135,234 | 262,635 198,780 | 149,781 
December. 75,517 | 115,877 282,068 | 167,297 | 166,173 


* Adapted from Waller Wynne, Jt Five Years of Rural Ве, p. 80, WPA, Division of 


Social Research, 1938. 
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\\ Attribute, 28, 231, 234 Campbell, N. R., 23 
o need for, 94 Caption of a frequency table, 71 
rep mtativeness of, 108, 130- Cardinal number, 15 


| Cards, machine tabulating, 48-49 
Average davintion, 122-124 Causal system, sampling a, 229 
(See also| Deviation, mean) Causes, search for, 26 : 
Averages, 94 Census, United States Bureau of the, 
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Beto; B, measure of kurtosis, 168 Charlier check, 127 
аз, 32 Chi-square, x?, 304 
A Bimodal, 95 substitute for standard error of 
Binet, Stanford-, intelligence test, 12, coefficient of contingency, 217 
19 test applied to a contingency 
| Binomial coefficients, 305 table, 148-149, 205-206, 208 
1 Binomial distribution, 151-156 to a fourfold table, 209 
asymmetrical (skewed), 156 used to test significance of differ- 
formulas for, 151, 152 ences between two frequency 
mean of, formula for the, 155 distributions, 269-272 
standard deviation of, formula for Class intervals, 61 
the, 155, 234 selection of, 64-68 
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Class limits, continuous variable, 69 
discrete variable, 69 
Classes, 61 
Classification, 10 
principles of, 69 
reliability of, 197-198 
Coding, 129 
use of, in computing measures of 
dispersion and partition, 129 
Coefficient of alienation, in linear 
correlation, formulas for, 182, 
183, 190 
Coefficient of contingency, 203-208 
Chi-square as substitute for 
standard error, 217 
computation of, 204-206 
correction for broad grouping, 207 
formulas for, 206 
interpretation of, 208 
sign of, 208 
standard error of, 217 
tabular arrangement for, 204 
Coefficient of correlation, 74, for 
fourfold tables, 211 
standard error of, 217 
Coefficient of linear correlation, r, 
grouped data, formulas for, 
185 
significance of, 257-258 
significance of the difference 
between two 7’s, 268-269 
values of the correlation coeffi- 
cient for different levels of 
significance, 306 
values of 2 for given values of т, 
307-308 
ungrouped data, formulas for, 
181, 183, 185 
meaning of, 182-184 
size of sample, 182 
Coefficient of regression, linear cor- 
relation, 180 
Coefficient of variation, 129-131 
(See also Variation, coefficient 
of) 
Combinations, 143-144 
formula for, 144 
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Comparable measures 
scales), 136-139 
percentiles, 137-138 
Q scores, 137 
standard scores, 136 
Concomitant variation, 26 
Confidence limits, 248-249 
(See also Fiducial limits) 
Contingency, coefficient of, 203-208 
(See also Coefficient of contin- 
gency) 
Contingency table, 25 
Continuous variable, 28, 62 
Control group, 26 
Cooperative definition, 44—45 
Coordinates of a point, 82, 172 
Correlation, biserial, 199-203 
formula for ты», 201 
тыз compared with т, 203 
scatter diagram, 200 
sign of љь, 203 
standard error of ты», 217 
table, 201 
contingency, 203-208 
(See also Coefficient of con- 
tingency) 
in fourfold tables, 208-217 
(See also Yule's Q, Coefficient 
of correlation, r4, for four- 
fold tables; Tetrachoric 
correlation) 
nonquantitative, 197 
biserial, 199-203 
(See also Biserial correla- 
tion; Correlation,  bi- 
serial) 
choice of method, 198-199 
coefficient of contingency, 203— 
208 
(See also Coefficient of 
contingency; Correlation 
contingency) 
74, for fourfold tables, 211 
tetrachoric correlation, 211-217 
Yule’s Q, 210, 213 
rank, 191 
formula for, 191 


(scores, 
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Correlation, simple linear quantita- 
tive, 171-196 
grouped data, correlation table 
and its explanation, 186-189 
formula for coefficient of 
alienation, 190 
formula for т, 185 
formula for standard error of 
estimate, 190 
formula for Y-intercept, 190 
formulas for regression coeffi- 
cient, 190 
ungrouped data, coefficient of 
alienation, k, 182-183 
coefficient of correlation, т, 
measuring amount of cor- 
relation, 180-184 
correlation due to а single 
case, 172 
does not extend beyond data, 
173-174 
formulas for т, 181-183 
goodness of fit and standard 
error of estimate, 177-180 
line of regression, 175-180, 
184, 185 
negative, 174-175 
normal equations, 175-176 
positive, 174 
regression coefficient, 180 
scatter diagram, 171-174 
tetrachoric, 211-217 
(See also Tetrachoric correla- 
tion) 
between time series, 286-288 
Cottrell, L. J., and E. W. Burgess, 20 
Counting, 10 
Cowden, D. J., and F. E. Croxton, 
23, 93, 121, 142, 170, 182, 195, 
254, 297 
Critical ratio, 258 
Crosshatching, 90-91 
Croxton, F. E., and D. J. Cowden, 
23, 93, 121, 142, 170, 182, 195, 
254, 297 
Culver, Dorothy C., 34 
Cumulative frequency curve (ogive), 
79-81 


Curve, of error, 157 
(See also Normal curve) 
of probabilities, 157 
(See also Normal curve) 
Cycles, correlation between, in two 
series, 286-288 
short-term, 283-286 
short-term, freed from seasonal 
fluctuations, 295-296 
in time series, 286 


D 


Dampier-Whetham, W. C. D., 9 
Davenport, C. B., and M. P. 
Ekas, 217, 220 
Davies, G. R., and Dale Yoder, 
142, 195, 207 
Deciles, 131, 134 
Definition, 10, 44—45 
Degrees of freedom, 148-149 
Delta, A, 95 
Deviation, mean or average, 122-124 
formula for, 123 
measures of, 122-142 
from an average, 122 
use of coding in computation, 
129 
quartile, 135-136 
(See also Quartile deviation) 
standard, о, 124-129 ` 
computation, grouped data, long 
method, 127 
short method, 127-128 
ungrouped data, long method, 
126 
short method, 126 
formula for, combined distribu- 
tions, 128-129 
Sheppard's correction, 128 
ungrouped and grouped data, 
124-125 
Dewey, John, 30 
Dichotomy, 24, 208 
Differences, between any two statis- 
ties, 259-260 
significance of sampling, 255-275 
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Differences, between statistics from 
more than two samples, 272-273 
Discrete aggregate, 18 
Discrete variable, 62 
Dispersion (see Deviation) 
Distribution, 232 
sampling, 232 
(See also Frequency distribu- 
tion) 
Districts, 243 
standard errors of sampling, 243— 
244, 246 
Durost, W. N., and Helen M. 
Walker, 75 


E 


Editing the statistical schedule, 47 
Ekas, M. P., and C. B. Davenport, 
217, 220 
Elderton, W. P., 220 
Elmer, M. C., 55 
Empirical standard error, 232 
Equally likely events, 149 
Error, accumulative, 52-53 
curve of, 157 
(See also Normal curve) 
of observation (record), 50-54 
probable, 161, 232 
(See also Probable error) 
in a ratio, 53 
relative, 52 
standard, 161, 217, 232-249 
(See also Standard error) 
Errors, biased, 50, 53 
unbiased (compensating), 52 
Event, 145, 221, 243-244, 246 
Existent universe, 222 
Expected value, 221 
Experimental group, 26 
Exponent, 109 
Ezekiel, Mordecai, 29, 182, 183, 195 


F 
Factor control, 24 


Failure (unsuccessful event), 149, 
222 
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Farm, definition of a, U. 8. Census 
of Agriculture, 1935, 33-36 
Federal agencies as sources of 

statistical data, 33 
Fiducial limits, 248-249 
(See also Confidence limits) 
Final test, 28 
Fine, H. B., 170 
Fisher, R. A., 30, 182, 195, 306 
Fourfold tables, correlation in, 208— 
217 
Fourth moment, 165 
Freedom, degrees of, 148, 149 
(See also Degrees of freedom) 
Frequencies, 60 
Frequeucy, 235-239 
standard error of simple sampling 
of a, 235-239 
of stratified sampling of a, 237 
Frequency array, 60-61, 66 
Frequency distribution, 60-69, 71— 
72, 107-108, 269-272 
continuous variable, tabulation of, 
68-69 
discrete variable, tabulation of, 
60-68 
rules of table form, 71-72 
shapes of, 107-108 
significance of the difference be- 
tween two or more, 269-272 
Frequency distributions, nonquanti- 
tative variable, tabulation of, 
69-70 
Frequency polygon, 76-79 
Fry, C. Luther, 55 
“Fundamental interval," in social 
measurement, 19 


G 


91, index of skewness, 165 
formula for, 165 
significance of, 258-259 

(See also Skewness) 

gz, index of kurtosis, 165 
formula for, 165 
significance of, 258-259 

(See also Kurtosis) 
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Galton, Sir Francis, 3 
Garrett, Н. E., 75, 142, 195 
Gaussian curve, 157 
(See also Normal curve) 
Geometric mean, 109-113 
applied to population growth, 
111-113 
formulas for, 109-110 
Gevorkiantz, S. R., and B. D. 
Mudgett, 231 
Giddings, F. H., 9 
Good, C. V., A. S. Barr, and D. E. 
Scates, 30 
Goodness of fit of regression line, 
177-178 
Goulden, C. H., 30 
Graphs, 76 
maps, 90-91 
misuse of, 86 
pictographs, 90-91 
pie chart, 89 
steepness of a line, meaning of, 116 
three-dimension, 89 
(See also Bar charts; Cumula- 
tive curve (ogive); Histo- 
gram; Lorenz curve; Polygon; 
Population growth graphs; 
Semilogarithmic graph; 
Smoothed curve) 
Gross reproduction rate, 116-117 
Grouping errors, 128 
Groups of events, 243 
standard errors of sampling, 243- 
244, 246 
Guilford, J. Р., 18 


H 


Heterogeneous universe, 223, 246 
Histogram, 76-79 
Holzinger, Karl J., 18, 170, 217, 220 
Homogeneous universe, 223 
Hooton, A. E., 219 
Horst, Paul, 137 
Hypothesis, 32 

null, 154 
Hypothetical universe, 222, 225 


Independent events, 260 
Index, 15, 16, 44, 45 
Individual, the, and statistics, 7 
Infinite universe, 222 
Instructions accompanying а sts- 
tistical schedule, 39-40 

Intangibles, measurement of, 18-20 
Intercept on the У axis, 175, 190 
Interfering variables, 29 
Interpretation of statistical results, 7 
Interquartile range, 136 

graph of, 136 
Interviewer, the statistical, 42, 47 


J 


J-type distribution, 107-108 

Jocher, Katherine, and Howard W. 
Odum, 55 

Johnson, H. M., 23 á 

Judges, use of, in social measure- 
ment, 15, 16 


K 


Karsten, K. G., 93 
Kelley, T. L., 142 
Kendall, M. G., and G. U. Yule, 
24, 75, 121, 170, 175, 182, 196, 
220, 254 
King, W. I., 105 
Kirkpatrick, Clifford, 23 
Kuhlman, A. F., 34 
Kurtosis, 165-168 
formula for, 165 
92, index of, 165 


L 


Laboratory sciences, 5 
Leptokurtic, 165 
Less-than cumulative 

curve (ogive), 79-81 
Levels of significance, 256-257 
Lexis sample, 231 


frequency 
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Limited universe, 222 
correction of standard error for, 
242-243 
Lindquist, E. F., 142, 195 
Line of regression, 175-180 
(See also Regression, line of) 
Linear correlation, 171-196 
(See also Correlation) 
Logarithms, 323 
five-place, 323-342 
Lorenz curve, 82-83 
Lundberg, С. A., 9, 23, 55 


M 


McCormick, Thomas C., 50, 212 
Maps, 90-91 
Marriage, predicting 
failure in, 20 
Matching, 26 
Mathematical statistics, 3, 5 
Mean, arithmetic, 100 
characteristics and interpretation, 
104-109 
definition of, 100 
grouped data, equal classes, short 
method, 101-103 
long method, 100 
unequal classes, short method, 
103-104 
significance of the difference be- 
tween two means, 264-266 
standard error of simple sampling 
of the, 239-240 
of stratified sampling of the, 240 
of two distributions combined, 63, 
104 
ungrouped data, 99 
weighted, 63, 104 
Mean, geometric, 109-113 
(See also Geometric mean) 
Mean deviation, 122-124 
(See also Deviation, mean) 
Mean probability, 230 
Measurement, of amount, 11 
rules of, 21-22 
Mechanical method, statistics not a, 
8 


success or 
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Mechanical tabulation of statistical 
data, 48-50 
Median, 97 
characteristics and interpretation 
of, 104-109 
definition of, 97 
grouped data, 97-99 
ungrouped data, 96-97 
Merrill, Maud A., and Lewis M. Р 
Terman, 23 4 
Merton, R. K., 16 
Mesokurtic, 165 
Mid-point, 62-64 
Mills, F. C., 9, 75, 142, 196, 254, 
283, 297 
Mode, 94-95 
bimodal distribution, 95 
characteristics and interpretation, 


104-109 

definition, 94 

formula for, 95 E 
Moments, 165-166 Y 
Mu, р, 165 
Mudgett, В. D., 75, 93 

and 8. В. Gevorkiantz, 231 
Mutually exclusive events, 146 

N 
National Unemployment Census of 
1937, 39-42 

Negative correlation, 174-175 
Net reproduction rate, 116-117 М 


Nonquantitative methods, role of, 31 
Nonquantitative variable, defined, 
69 
tabulation of, 69-70 
Normal distribution (curve), 156— 
168 
approximation of symmetrical bi- 
nomial, 156-157 . | 
areas and ordinates of, 299-308 . 
calculation of ordinates of, 158 
formulas for, 157 v 
graphs of, 156 
table showing a, 159 
use in determining probabilities, 
160-163 


| 
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Normal equations, straight line, 
175-176 

Normalization, 137 

Nu, р, 165 

Null hypothesis, 154 


о 


Odum, Howard W., and Katherine 
Jocher, 55 

Ogburn, У. Е., 9 

Ogive, 79-81 

Ordered data, 11 

Ordinal number, 15 

Ordinate, 158 

Origins of statistics, 3 


P 


Palmer, Vivien M., 55 
Parameter, 221 
definition of, 221, 231 
Parent, synonym for universe, 221 
Partition values, 131-136 
decile (see Decile) 
median (see Median) 
percentile (see Percentile) 
quartile (see Quartile) 
Pearson, Karl, 3, 217 
Percentile, 131-134, 136 
formula for, 188 
Percentile rank, 134-136 
formula for, 135 
Permutations, 143-144 
formula for, 143 
Peters, C. C., and W. R. Van Voor- 
his, 30, 207, 217, 220, 254 
Pictographs, 90-91 
Pie chart, 89 
Platykurtic, 165 
Poisson (stratified) sample, 224, 
' 280-231, 234 
Polygon, frequency, 76-79 — 
Population, synonym for universe, 
221 
Population growth, 82, 111 
estimates of, 111-113 
graphs of, 82-87 


Population rates, 114-117 
gross reproduction rate, 116—117 
meaning of, 114-116 
net reproduction rate, 116-117 
standard error of, 244-246 
Positive correlation, 174 
Prediction of a mean vs. individual 
values, 250 
Pretest, 28 
Primary statistical data, 37 
Probabilities, curve of, 157 
(See also Normal curve) 
Probability, 145-151 
addition theorem, 146 
definition of, 145 
mean, 230 
product theorem, 147 
of т successes in 7 trials, formula 
for, 150 
Probable error, 161, 232 
Problem in statistical inquiry, 31 
Proportion, 238 
standard error of simple sampling 
of, 238-239 
of stratified sampling of, 239. 
Proportional sample, 230 
Proportions, 266 
significance of the difference be- 
tween two, 266-268 
Punching machine, 49 


Q 


Q, Yule’s coefficient of correlation 
for fourfold tables, 210 

Q scores, 137 

Qualitative data, 197 

Quality, 4 

Quantification of social data, 10-23 

Quantity, 4 

Quartile deviation, 135-136 

formula for, 136 

Quartiles, 131-134, 136-137 

Questionnaire, 37 

Quetelet, 3 


R 


Random, 224 
Random sample, 224—225 
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Random sampling numbers, 226— 
228 
Randomization, principle of, 27-28 
Range, 60 
Rank correlation, 191-192 
formula for, 191 
in time series analysis, 287 
Ranking, 11 
Rates, 109-110, 113 
Rating, 11, 45 
Ratio, 53, 109 
Recurrent universe, 222 
Regression coefficient, linear cor- 
relation, 180, 190 
Regression equations, linear cor- 
relation, 175-176 
error in predicting а mean vs. 
individual values, 250 
formula for, when r is known 
184-185 
formulas for, 175, 176 
geometric meaning of, 175 
goodness of fit and standard error 
of estimate, 177-180 
normal equations, 175-176 
use of, for prediction, 179 
Relationship (gross) between two 
factors: nonquantitative cor- 
relation, 197 
(See also Correlation, nonquan- 
titative) 
Relationship (gross) between two 
factors: simple linear quanti- 
tative correlation, 171-196 
(See also Correlation, linear) 
Reliability, 20, 42-43 
Repeated trials, 151 
Replication, 27 
Representative data, 6 
sample, 246 
Representativeness of an average, 
108, 130-131 
of a sample, 250-252 
Rice, Stuart A., 9 
Richardson, C. H., 18, 170 
Rider, P. R., 148 


Root, mean square-. deviation, 124— 
129 
(See also Deviation, standard) 
“Rounding off,” 53 
Ruling of a frequency table, 72 


5 


Sample, 6 
Bernoulli, 224 
large, 234 
Lexis, 231 
Poisson (stratified), 224, 230-231, 
234 
proportional, 230 
random, 224—228, 255-256 
representative, 246, 250-252 
simple, 224-226, 229-231, 234, 256 
size of, in relation to standard 
error, 234, 237 
stratified (Poisson), 224, 230-231, 
234 
taking the, 224-232 
Sampling, 221, 224-232 
confidence (fiducial) limits, 248 
by groups of events, 228-229 
general theory of, 232-234 
random sampling numbers, 226- 
228 
unit of, 243 
Sampling differences, 255-275 
(See also Significance) 
Sampling distribution, 232 
Sampling errors, 234 
simple sampling errors applied to 
random and stratified sam- 
ples, 234 
(See also Standard error) 
Seale, the, 14 
Chapin’s socioeconomic, 12, 20 
graphic rating, 14 
Thurstone’s attitude, 15-17 
Scates, Douglas, 23, 30 
Scatter diagram, 171-174, 188 
Schedule, editing, 48 
the statistical, 37-40 
testing, 42-47 
Scores, 12 


| 
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Scoring, 12 
Seasonal fluctuations in time series, 
288-295 
Second moment, 165 
Secondary statistical data, 33-36 
Secular tread, 277-283 
Semi-interquartile range, 135-136 
(See also Quartile deviation) 
Semilogarithmic paper, 84-85, 88 
Sheppard's correction, 128 
Sigma, >, с, 99, 124 
Significance of a correlation coeffi- 
cient, 257-258 
of the difference between any two 
correlated statistics, 259-260 
of the difference between any two 
independent statistics, 260 
of the difference between the 
combined mean of two simple 
samples from the same uni- 
verse and the mean of either 
one of the samples, 266 
of the difference between the 
means of two samples sup- 
posed to be simple samples 
from the same universe, 264- 
265 
of the difference between the 
means of two simple samples 
from different universes, 265- 
266 е 
of the difference between statis- 
tics from more than two 
samples, 272-273 
of the difference between two cor- 
related means, 261-263 
of the difference between two cor- 
relation coefficients, 268-269 
of the difference between two 
independent means, 263-204 
of the difference between two or 
more frequency distributions, 
269-272 
of the difference between two 
proportions, 266-268 
of gı and ga, 258-259 
levels of, 256-259 
meaning of tests of, 255-257 


Significance of sampling differences, 
255-275 
of a sum, 269 
Significant figures, number of, 53 
Simple sample, 224-226, 229-231 
error of sampling applied to 
random and stratified sam- 
ples, 234 
Simple sampling, 269 
test of the hypothesis of, 269 
Simplicity the statistical ideal, 8 
Size of sample, 234, 237, 246-249 
Skewed frequency distribution, 107 
binomial, 156 
formulas for, 164, 165 
geometric mean of, 110 
graphs of, 107, 164 
meaning of the standard deviation 
or standard error of, 161 
representativeness of averages of, 
107, 108 
table showing a, 164 
(See also 91, index of skewness) 
Slope of line, 175 
Smith, James G., 9, 170 
Smoothed frequencies or curve, 79 
Snedecor, G. W., 29 
Social sciences, 4, 5, 6 
Social statistics, 3 
Socioeconomic status, Chapin’s scale 
for measuring, 12 
Sociological journals, 32 
Sorenson, H., 75, 142 
Sorting machine, electric, 49-50 
Squares and square roots, 309-322 
Standard deviation, с, 124-129 
(Sce also Deviation, standard) 
Standard error, 161 
of arithmetic mean, 239-240 
controlled by size of sample, 
246-249 
corrected for limited universe, 
242-243 
of а frequency, 235-237 
of & population rate, 244-246 
in predicting a mean vs. individual 
values from a regression equa- 


tion, 250 
* 
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Standard error of a proportion, 238— Symmetrical frequency distribution, 


239 
of biserial 7, 217 
of coefficient of contingency, C, 
217 
empirical, 232 
of standard deviation, 241 
stratified or Poisson sampling, 
of arithmetic mean, 240 
of a frequency, 237 
of a proportion, 239 
of tetrachorie r, 217 
theoretical, 232 
when unit of sampling is а group 
of events or a district, 243— 
244, 246 
of Yule's Q, 217 
Standard error of estimate, linear 
correlation, 177-180 
formulas for, 178, 190 
meaning of, 179 
Standard scores, 136 
Stanford-Binet intelligence test, 12, 
19 
Statistic, definition of, 221 
true, 136 
Statistics, and the individual, 7 
the method of probabilities, 4 
origins of, 3 
social, 3 
Statistics not а mechanical method, 
8 
Steepness of a line graph, meaning 
of, 116 
Straight-line relationship, 19, 277— 
281, 291 
Stratified sample, 224 
errors of simple sampling applied 
to, 234 
universe, 246 
Stub of a frequency table, 71 
Success, i.e., successful event, 149, 
222 
Sum, significance of a, 269 
Summation, 99 
Symmetrical frequency distribution, 
graph of, 106 


representativeness of average 
of, 106—108 
Symonds, P. M., 18 


T 


Tables, caption, 71 
rules of form for frequency, 71-72 
ruling, 72 
statistical, 41—42 
stub, 71 
title, 71 
Tabulation of frequency distribu- 
tions, hand methods, 59-75 
of statistical data, mechanical 
methods, 48-50 
Tabulating machine, electric, 50 
Tallying, 60 
Terman, Lewis M., and Maud A. 
Merrill, 23 
Test, final, 28 
Tetrachoric correlation, 211-217 
computing diagrams for, 215-217 
formulas for, 212 
Standard error of, 217 
'Theoretical standard error, 232 
"Thermometer, 18, 19 
Third moment, 165 
Thorndike, E. L., 45 
Three-dimension graphs, 89 
Thurstone, L. L., attitude scale, 
15-17 
computing diagrams for the tetra- 
choric correlation coefficient, 
215-216 : 
“Fundamentals of Statistics,” 196 
Time series, analysis, 276 
correlation between short-term 
cycles of two time series, 
286-288 
graphs of, 82-87, 277 
seasonal fluctuations, 288-296 
secular trend, a moving average, 
281-283 
straight line, 277-281 
short-term cycles, 283-286 
freed from seasonal fluctuations, 
295-296 
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Tippett, L. H. C., 170 

random sampling numbers, 226- 

228, 254 

Title of a frequency table, 71 
Transcription sheet, statistical, 41 
Treloar, A. E., 148, 170, 254 
Trend, 277 

secular, 277-283 


U 


Unit of sampling, 221, 243-244, 246 
Units, equality of, in social measure- 
ment, 12, 14, 15, 18, 19, 21, 59 
Universe, 136, 221 
binomial, 233 
changing, 222 
existent, 222, 226 
heterogeneous, 223, 229, 246 
homogeneous, 223, 229, 234 
hypothetical, 222, 225-226, 229- 
230, 234 
infinite, 222, 228-229 
limited, 222, 226, 228, 242-243 
mixed, 246 
recurrent, 222 
stratified, 246 
unique, 222 
Unordered data, 10 


У 


Validity, 20, 42-46 
Van Voorhis, 
Peters, 30, 207, 217, 220, 254 

Variable, 62, 231 

continuous, 28, 62 

discrete, 62 
Variables, interfering, 28 
Variance, 128 


W. R., and С. C. 


Variation, coefficient of, 129-131 
for comparing variation, 131 
formulas of, 130 
as а measure of the representa- 
tiveness of an average, 130— 
131 
need for, 129-130 


У 


Walker, Helen M., 9 
and W. N. Durost, 75 
Waugh, A. E., 297 
Weighted arithmetic mean, 63, 104 
Weighting, 12 
Whelpton, P. K., 32 
White, R. C., 75, 142, 196, 297 
Wolf, A., 30 


х 


Х axis, 77 
x? (see Chi-square) 


У 


У axis, 77 

Y-intercept, 175, 190 

Young, Pauline V., 55 

Yule, G. U., and M. G. Kendall, 24, 
75, 121, 170, 175, 182, 196, 
220, 254 

Yule's Q, coefficient. of correlation 
for fourfold tables, 210 

standard error of, 217 


2 


2, values of, for given values of т, 
307-308 
Zero point on scale, 12, 14, 15, 19, 22. 
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